Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable concurrent execution of Python functions #236

Closed
maiqbal11 opened this issue Nov 12, 2018 · 44 comments
Closed

Enable concurrent execution of Python functions #236

maiqbal11 opened this issue Nov 12, 2018 · 44 comments
Assignees
Labels
func-stack: Python P0 [P0] items : Ship blocking
Milestone

Comments

@maiqbal11
Copy link
Contributor

maiqbal11 commented Nov 12, 2018

I've managed to isolate an issue where a queue trigger function is unable to call an http endpoint from a function with the same app. This behavior manifests when the functions are synchronous or asynchronous.

Repro steps

Provide the steps required to reproduce the problem:

  1. The offending code is here: QueueWithHttpCall.zip. There is a function called QueueTriggerPython which pulls items from queues and triggers a GET request to an endpoint (from the IntakeHttpTrigger function).
  2. Activate virtual environment and install requirements: pip install -r requirements.txt.
  3. Install extensions: func extensions install
  4. Configure local.settings.json to point to the storage account that you are using.
  5. Run func host start and then add item to the configured queue.

Expected behavior

The queue trigger should be able to successfully call into the HTTP endpoint and return the correct status code as well as print out the log message to the console.

Actual behavior

The queue trigger is activated but hangs when trying to call the HTTP endpoint, getting stuck at the following point:

Executing 'Functions.IntakeHttpTrigger' (Reason='This function was programmatically called via the host APIs.', Id=411d2a47-71de-4ab1-b231-c99a9794a7cb)

This is after the point at which the host calls into the language worker to process the request.

Known workarounds

  1. Both QueueTriggerPython and IntakeHttpTrigger are synchronous. Raise the number of workers in dispatcher.py to 2:

  2. Both QueueTriggerPython and IntakeHttpTrigger are asynchronous. No known workarounds.

Related information

For the sync case, there appears to be some issue with the worker threads which are blocking themselves rather than processing the request from a function in the same app. This would explain why raising the number of thread workers to 2 caused the call to go through (one thread for queue trigger and one for the subsequent http call it makes). For the async case, it might be a manifestation of a similar issue since we are executing in the main event loop rather than using a separate thread pool.

@maiqbal11
Copy link
Contributor Author

\cc @asavaritayal @1st1 @elprans

@asavaritayal
Copy link
Contributor

@elprans can you investigate this issue?

@asavaritayal
Copy link
Contributor

Also adding @1st1 since this seems to be related to how we're handling the thread pool.

@elprans
Copy link
Collaborator

elprans commented Nov 13, 2018

@maiqbal11 Your async case does not work because you are making a blocking http request, which blocks the entire event loop. Use aiohttp to make self-requests and it would work.

As for the sync case, I don't think we can handle this safely. Increasing max_workers to a value greater than 1 opens a huge can of worms, since it makes user code essentially multi-threaded with all the consequences that has.

@maiqbal11
Copy link
Contributor Author

@elprans Works as expected when using aiohttp. Thanks for the clarification! Since the sync case can't be handled safely, the recommendation would be to use async constructs when sending requests to the same function app.

@maiqbal11
Copy link
Contributor Author

maiqbal11 commented Nov 14, 2018

@maiqbal11
Copy link
Contributor Author

Re-opening as there is some pending discussion on, potentially, raising the number of workers in the thread pool executor.

@maiqbal11
Copy link
Contributor Author

Based on a conversation with @paulbatum who had a few questions/ideas about changing the thread pool size. The basic question is whether we should trade off some of the safety guarantees of having single-threaded user code in favor of allowing (potentially non-proficient) users to write sync code that can run more concurrently.

For more context, the issue in this thread was raised by an internal customer with concerns about latency in their calls. It turned out that this particular scenario did not work because there was only one worker in the threadpool and so requests from one function in the app to another could not be processed. It is likely that there will be other customers who face this issue as well - with the expectation that their code will be able to run with some multi-threading that we provide. Each of these customer issues could turn into a support cases that we would need to tackle.

Some discussion questions based on this:

  1. What are the pitfalls that we run into when allowing multi-threaded code and how likely are these for the average user case? This would unblock basic scenarios like the one noted in this issue. @elprans, Perhaps you can elaborate more on this.

  2. Can we expect concurrency by using multi-threaded code even if we do not utilize async/await constructs? I did an experiment on this where I had an HTTP endpoint responding with a 10 second delay to calls (made using Python requests library) from a queue trigger. I ran it for a batch of 4 queue messages with max_workers=1 and max_workers=4. Both cases produced similar performance (~40 seconds to run in total). In the case where max_workers=4, the expectation would be that when the requests.get(url) is called, that the thread waiting for a response would give control to another thread that has not yet been able to send the request. However, that does not seem to be the case as the threads execute in an essentially synchronous fashion: Waiting for the duration of the GET request before another thread gets to execute. I've been unable to find a concrete answer on this but it does look like threads cede control for certain operations (https://stackoverflow.com/questions/14765071/python-and-truly-concurrent-threads). Would really like to hear more thoughts/get clarity on on this.

@brettcannon
Copy link
Member

Since I was cc'ed I will say my opinion is to agree with Elvis and say to not up the number of workers and push users towards async/await to both keep the worker simpler and to help users not shoot themselves in the foot with threaded code.

@paulbatum
Copy link
Member

Hey @brettcannon thanks for weighing in. I am trying to figure out how to follow your advice, balancing that with the promises we make around functions regarding dynamic scaling, and effective utilization of our hardware. For example, I have concerns that many python users will write their functions using synchronous APIs, and this will result in very poor per-instance throughput (e.g the entire application instance is idle while waiting for a single outbound HTTP request), which will in turn cause us to scale out the application more aggressively, using additional hardware.

Basically what it boils down to is that we want functions to run the user code as efficiently as possible, and this applies for both well written code, and not-so-well written code. Now there are many mistakes that customers can make that we can't correct for, but that doesn't mean we should give up completely.

Can you help me to understand the downside of us allowing multithreaded execution of synchronous python functions within a single application instance? What are some examples of how customers could "shoot themselves in the foot"? I'd like to understand if these examples are unique to Python. In contrast, we allow C# functions to be written to execute synchronously and we rely on .NET threading to provide adequate performance for these scenarios. The stateless programming model for functions means that we can do this without really having to teach users the ins and outs of multithreaded programming.

@brettcannon
Copy link
Member

No, there's nothing special here in regards to Python and threads. The only thing to be aware of is CPython's GIL means that CPU-bound code won't see any benefit through threading, only I/O-bound code.

@1st1
Copy link
Collaborator

1st1 commented Nov 20, 2018

CPython's GIL means that CPU-bound code won't see any benefit through threading, only I/O-bound code.

Also because of the GIL Python libraries and types aren't always threadsafe. So I'd be extremely cautious to run all code in a multi-threaded mode by default (and that's why I implemented this restriction in the first place.)

@paulbatum
Copy link
Member

Several other languages have APIs and surface area that is not threadsafe - C# and Java are both good examples. We don't force those to run single threaded. We rely on customers to either stay in the stateless programming model (and not worry about threadsafety), or to be careful when they go outside the bounds of the stateless model (such by using static variables).

I am not sure I understand why enforcing a single thread of execution is a good tradeoff in the case of Python. There's the hardware utilization concerns I mentioned above, and similarly, customers that choose to run functions on dedicated hardware (such as an App Service plan) are likely to open support tickets reporting poor performance. Tickets that require analyzing the customer code to diagnose are typically expensive.

The number of support cases we've received from C# developers running into threadsafety issues is truly tiny. I think we've already had more cases about poor python performance due to the use of synchronous APIs that do IO.

Any more insights or examples you can share to help me understand your perspective? Do my concerns make sense to you?

@brettcannon
Copy link
Member

Threading is just not as big of a thing in the Python community as it is in C# and Java. The GIL is enough of a thing that most people simply don't bother. This means you can't rely on libraries to not stuff things into global state that won't fall over badly in a multi-threaded situation because they never cared about race conditions. I don't know how the worker runs multiple functions, but if you're sharing modules across workers then debugging that will be tough because the interactions won't be from the the code in your function in a single execution but from some other function running simultaneously and modifying things in a strange way. Python has a lot of global state because people never think about this sort of thing. (It also ties into Python putting a priority on developer productivity since threading is not exactly a good way to make yourself write better code 😉 .)

In my opinion, if increasing the number of workers is just a setting flip then I would try it with the current setting and see how users respond. If you say "use async for increased performance" and they still come back decrying the lack of threads then would it be difficult to increase it later on? Compare that to giving threads initially, users complaining about weird bugs in their functions, and then having to scale it back after people have put in the effort to try and make threads work. To me the former is improving things for users (if it comes to that), while the latter is walking back (if it comes to that).

But I'm not maintaining the service or dealing with users and this is all subjective so I unfortunately don't have a magical answer for you short of asking the community how much they want threads in the face of async being available and potential debugging difficulty (I know I personally will only be doing async workloads for scaling purposes 😁 ).

@paulbatum
Copy link
Member

@brettcannon Thanks Brett, this helps. Following on a little from your point about what it might make sense to start with and what we could change later, I'm concerned that starting with single threaded mode will tie our hands somewhat in that we could not really switch the default to multiple threads at a later point in time, without the risk of suddenly breaking lots of code that was written without thread-safety in mind. You're right that we could later add some sort of opt-in setting that allows multithreaded execution but that won't help me get effective utilization of our hardware in the case of consumption (I can't guarantee that users will opt-in).

I guess one possibility we could consider that we haven't discussed yet is that we run multiple python worker processes. This is less efficient from a memory utilization perspective, but it would allow concurrency within a single machine without exposing the user to threadsafety issues.

@asavaritayal asavaritayal changed the title Queue triggered function fails to call into Http Trigger Enable concurrent execution of Python functions Jan 4, 2019
@asavaritayal asavaritayal modified the milestones: Active Questions, Backlog Jan 4, 2019
@asavaritayal asavaritayal added the P2 [P2] items : Not ship blocking label Jan 4, 2019
@asavaritayal asavaritayal added P0 [P0] items : Ship blocking and removed P2 [P2] items : Not ship blocking labels Feb 19, 2019
@polarapfel
Copy link
Member

@ericdrobinson Thanks for your responses.

I think I am trying to understand if multi-threading in Python in general is a legit way to introduce true parallelism to execution (from what I've read, the answer seems to be no), hence if that's the case, then why use multi-threading on Azure Functions within a single execution host other to avoid blocking calls. I've compared how Azure Functions with Javascript works and the documentation for Node/Javascript on Azure Function is specifically advising to choose single-vCPU App Service plans. My guess is, that with Node/Javascript, there also is no true multi-threading, the illusion of concurrency is achieved through the event loop. There is literally no benefit of using multiple cores then and parallelism is achieved on Azure Functions by letting Azure scale to any number of independent Function App invocations on as many single vcore hosts as needed. The same probably applies to Python in that same context?

As to the multiprocessing module within a function: assuming I do choose an App Service plan with a host SKU that has multiple cores, utilizing them with multi-threading won't work. Let's say I have a CPU intensive task on a queue of items (that I want to batch) that benefits from parallel execution, each atomic task is complex enough that its execution is way more expensive than forking a process and tasks do not need to communicate with each other. In that case, I would gain by being able to use the multiprocessing module, right?

In the end, my ask comes down to this: the outcome of this ticket should be thorough guidance for Python developers where to implement parallelism within a function and where to rely on the (auto)-scaling of Azure Function. And as a second part, when parallelism is implemented within a function, providing some detailed guidance as to which approaches work and which won't.

@ericdrobinson
Copy link

@polarapfel Some followup responses:

My guess is, that with Node/Javascript, there also is no true multi-threading, the illusion of concurrency is achieved through the event loop.

The latest version of Node/JavaScript does support multithreading. Please see the Worker Threads module.

That said, JavaScript processing is a typically single-threaded thing. The asynchronous processing happens thanks to the Event Loop.

There is literally no benefit of using multiple cores then and parallelism is achieved on Azure Functions by letting Azure scale to any number of independent Function App invocations on as many single vcore hosts as needed.

Not really, no. You can do advanced things with NodeJS that would allow you to take advantage of multiple cores from within a single function instance. While I've not encountered the documentation you're referring to, my guess is that the general wisdom is that most common workloads that you perform with NodeJS have no need for multithreading and therefore more cores are simply wasted.

The same probably applies to Python in that same context?

To some extent, sure. The reasons for this are entirely different but multithreading and multiprocessing are viable options in Python. You just need to be judicious with which you invoke.

As to the multiprocessing module within a function: assuming I do choose an App Service plan with a host SKU that has multiple cores, utilizing them with multi-threading won't work.

That isn't true. That's true IF your workload is CPU-bound (e.g. lots of maths). However, IF your workload is IO-bound (e.g. lots of networking/file IO/etc.) then multithreading will serve you well. You just need to be very careful to ensure that your functions adhere to the model outlined in the documentation.

Let's say I have a CPU intensive task on a queue of items (that I want to batch) that benefits from parallel execution, each atomic task is complex enough that its execution is way more expensive than forking a process and tasks do not need to communicate with each other. In that case, I would gain by being able to use the multiprocessing module, right?

Yes. And you CAN use the multiprocessing module in an Azure function. Debugging such a setup in VSCode appears to be broken at present but running them live does work (as I reported here).

In the end, my ask comes down to this: the outcome of this ticket should be thorough guidance for Python developers where to implement parallelism within a function and where to rely on the (auto)-scaling of Azure Function.

I wholeheartedly agree! :D

And as a second part, when parallelism is implemented within a function, providing some detailed guidance as to which approaches work and which won't.

I agree a little less, maybe? Microsoft shouldn't have to educate people on how Multiprocessing/Threading in Python works. The Python documentation covers most of that. The guidance I would hope to see would be at the level of: "If you have a computationally-intensive task, structure your function like this. If you have an IO-intensive task, structure your function like that. Please see the Python documentation for more on these topics."

@polarapfel
Copy link
Member

Hey @ericdrobinson,

here is an example (taken from "Serious Python") for a CPU intensive workload, there is no IO:

import random
import threading
results = []
def compute():
    results.append(sum(
            [random.randint(1, 100) for i in range(1000000)]))
workers = [threading.Thread(target=compute) for x in range(8)]
for worker in workers:
    worker.start()
for worker in workers:
    worker.join()
print("Results: %s" % results)

Running this on an idle CPU with 4 cores looks like this:

$ time python worker.py
Results: [50517927, 50496846, 50494093, 50503078, 50512047, 50482863,50543387, 50511493]
python worker.py  13.04s user 2.11s system 129% cpu 11.662 total

This means that out of the 4 cores, only 32 percent (129/400) were used.

The same workload rewritten for multiprocessing:

import multiprocessing
import random

def compute(n):
    return sum(
        [random.randint(1, 100) for i in range(1000000)])

# Start 8 workers
pool = multiprocessing.Pool(processes=8)
print("Results: %s" % pool.map(compute, range(8)))

Executed on the same idle CPU with 4 cores:

$ time python workermp.py
Results: [50495989, 50566997, 50474532, 50531418, 50522470, 50488087,
0498016, 50537899]
python workermp.py  16.53s user 0.12s system 363% cpu 4.581 total

It results in more than 90% CPU usage (363/400) and a 60% reduction in execution time.

To me, that is pretty compelling evidence that multi-threading does not help with parallel compute tasks in Python - even for CPU only tasks without any IO.

@ericdrobinson
Copy link

To me, that is pretty compelling evidence that multi-threading does not help with parallel compute tasks in Python - even for CPU only tasks without any IO.

Sure. But what is your point, exactly? That the Python Function invocations themselves should be spawned into a Process pool rather than a Thread pool?

I'm asking because even with the current model, I believe that you can access those 4 cores from a single Python Function invocation with your own Process Pool implementation. Does that make sense?

@polarapfel
Copy link
Member

Sure. But what is your point, exactly? That the Python Function invocations themselves should be spawned into a Process pool rather than a Thread pool?

I guess I have more questions than points here. :)

I'm asking because even with the current model, I believe that you can access those 4 cores from a single Python Function invocation with your own Process Pool implementation. Does that make sense?

I guess it goes back to your point that when it comes to documentation, the expectation can be that Python developers would know about parallel computing when it comes to generic Python language and run-time features, but while it can be does that mean it should be?

Take these two excellent articles on the subject for example:

Data and chunk sizes matter when using multiprocessing.Pool.map() in Python
In Python, choose builtin process pools over custom process pools

These are educational in their own right. But applying that knowledge to writing a serverless workflow really depends. It depends on the nature of the processing work, queuing strategies, which sandbox limits are relevant, what App Service plan you're on, which host SKU you choose and the list goes on. I think we need detailed guidance in writing if we want the average developer to get the most out of Azure Functions.

@brettcannon
Copy link
Member

@ericdrobinson @polarapfel as a Python core developer I can tell you that under CPython -- which is what Azure Functions runs under -- you will not gain anything with a CPU workload with threads. The only benefit to threads under CPython is when I/O is blocking in a thread, allowing another thread to proceed (which is the same effect as using async/await except more explicitly).

@maiqbal11
Copy link
Contributor Author

maiqbal11 commented Jul 2, 2019

Work completed. Pending documentation tracked here: #471.

@ericdrobinson
Copy link

Can't wait to see that documentation!!

Also, a quick update on one of my last "reports":

It looks as though the ProcessPoolExecutor approach may not be the performance salve that I previously reported. In testing, when I run my function in a "worst case" context, it appears that the non-async version actually runs far faster than the async version.

It tuns out this was caused by the processing I mentioned in step 3.ii of My Use Case. The "processing" that I do actually involves running ffmpeg to decode a file. The decoding itself is multi-threaded. Attempts to decode multiple files at the same time invoke multiple ffmpeg processes, each spawning multiple threads. This leads to an overall degradation of performance.

In my specific case, we "worked around" this limitation by simply restricting the "heavy processing"... process... to a single instance. I should note that I have yet to try the -threads option in FFMPEG to see if we get any gains by opening things up a bit.

Regardless, I'm very keen on seeing what work came of this!

@tomhosking
Copy link

@ericdrobinson Thanks for all your investigation work on this!

Following on from your approach here, I had 2 questions:

  1. How were you able to use asyncio.get_running_loop() given that it was added in 3.7, but the current Azure worker is 3.6?

  2. Did you find a way of "re-attaching" logging from the other processes to the main Azure logger?

@ericdrobinson
Copy link

@tomhosking Answers below:

How were you able to use asyncio.get_running_loop() given that it was added in 3.7, but the current Azure worker is 3.6?

I used the following:

loop = asyncio.get_event_loop()

Did you find a way of "re-attaching" logging from the other processes to the main Azure logger?

Unfortunately, no. If memory serves I might have been able to watch the standard output stream when testing the function on my local machine. I also recall creating a specific entry in the dictionary returned by the process that included any debug output (all functionality was wrapped in a try-except block just in case). That info would then get sent to the main process log for inspection. It was a kludgey workaround but it got the job done.

Would definitely love a better solution to this...

@tomhosking
Copy link

Ah, makes sense! Thanks.

It seems like as per this PR, setting FUNCTIONS_WORKER_PROCESS_COUNT > 1 does indeed enable multiple workers, and they seem to behave as I would expect (ie multiple requests handled, logging works) from very brief testing.

@ericdrobinson
Copy link

Ooooh, awesome! This entire approach may not even be necessary anymore!!

Will have to check it out at some point. Our limiting factor turned out to be ffmpeg eating as many threads as it could grab. As such there's little to be gained from trying to run multiple ffmpeg processes on the same CPU[set]...

@abkeble
Copy link

abkeble commented Jul 12, 2019

We have been looking at this thread to find similar answers, and we were expecting to be able to run multiple calls to the same function in parallel. Simply setting the FUNCTIONS_WORKER_PROCESS_COUNT to the maximum of 10 still seems very limited as we would want to be able to scale to 100s of functions running in parallel.

@paulbatum
Copy link
Member

@abkeble I think you've misunderstood what this setting does exactly so let me clarify. The short answer is that the Azure Functions platform can scale your app to run tens of thousands of concurrent executions.

The design of how the functions infra calls into your python process to run your code absolutely allows for concurrent executions within a single python process. However if the code is written using APIs that block, then concurrent execution within that process is prevented. This thread has been about how we can make it so that the system naturally uses available resources even in cases where uses write this type of code.

The approach we've come up with is to provide support for running multiple python processes on a single VM. Its a setting that you can use directly today, but in the future we'll make the system smart enough to dynamically adjust. If you have an IO bound workload and write good async python functions, its wasteful to run multiple python processes, so in that case the system would stick with just one process per VM. We've aggressively set a low limit on this setting (currently 10) because its very easy to hit the natural memory limits of azure functions as you create additional separate python processes.

Our scale out architecture is based on running your function across many VMs. Your functions can scale to hundreds of VMs to get the needed level of concurrency.

@ericdrobinson
Copy link

Your functions can scale to hundreds of VMs to get the needed level of concurrency.

@paulbatum No, they actually can't. At least today. Azure Functions for Python is still in Preview (see the Note at the top of this page).

Last time I checked/read, we are currently limited to 2 concurrent VMs. The maximum number of concurrent processes, then, is 20.

Any word on how much longer AFfP will be in Preview? Is it getting close, at least?

@paulbatum
Copy link
Member

@ericdrobinson For the linux consumption plan, we have a deployment in progress that will increase the limit on concurrent VMs to 20. Rough ETA for global deployment of this change is 7/22. We expect to continue to increase this limit in subsequent deployments. I can't provide any specifics around when the python offering will exit preview, but yes, we are getting much closer.

@ericdrobinson
Copy link

@paulbatum That's excellent news! Excited to see it happen!

Will this perhaps coincide with a fix for #359?

@abkeble
Copy link

abkeble commented Jul 25, 2019

@paulbatum I don't believe this has been released yet, is there any update on when we can expect this? Thanks!

@paulbatum
Copy link
Member

@abkeble Are you referring to the change of how many VMs you can concurrently execute on? Because my understanding is that is now live everywhere. If you're not seeing the behavior you're expecting, can you file a new issue, include app name and timestamp, and then mention me or leave a link here?

@balag0
Copy link

balag0 commented Jul 30, 2019

@abkeble How are you verifying the number of scale out instances. If it is through app insights, we have a known issue where app insights always shows 1 server instance live - #359
The fix for that issue will begin deploying later this week.

If you can share the sitename and timestamps i can also make sure there are no other issues with the scale out.

@abkeble
Copy link

abkeble commented Jul 31, 2019

@balag0 @paulbatum We were looking at the app insights value which is what misled us. We are seeing an increased number of VMs now, however this is only looking like 3 or 4 in total from the 1 or 2 we were seeing previously.
@balag0 Would you be able to confirm this?
Timestamp: 2019-07-31 12:59:51.344727 BST
Sitename: https://tempnwg.azurewebsites.net/

We are testing the scale out by running a function that sleeps for 5 seconds. We run this function (Named SlowFunction) 20 times concurrently and are getting batches of 3 or 4 responses every 5/6 seconds. This is what leads us to believe there are only 3 or 4 instances running. Whereas we would expect the function calls to be returning in batches of 20 if the consumption plan had scaled out to 20 instances. We have set FUNCTIONS_WORKER_PROCESS_COUNT to 1 to simplify the testing.

@ericdrobinson
Copy link

For anyone interested, the FUNCTIONS_WORKER_PROCESS_COUNT app setting is documented here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
func-stack: Python P0 [P0] items : Ship blocking
Projects
None yet
Development

No branches or pull requests