Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BlockingIOError on App Engine after some time #50

Closed
drazzib opened this issue Jun 22, 2020 · 29 comments
Closed

BlockingIOError on App Engine after some time #50

drazzib opened this issue Jun 22, 2020 · 29 comments

Comments

@drazzib
Copy link

drazzib commented Jun 22, 2020

After adding google-cloud-profiler==1.1.1 to my Python 3.7 App Engine project, I'm getting lots of

BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:

in my service stderr logs

I'm having the same issue either with App Engine Standard python3.7 or with App Engine Flex over a custom runtime (Docker image)

Is this a known issue, is there something I can do to mitigate this issue ?

@kalyanac
Copy link
Contributor

@drazzib Thank you for reporting this. This is not a known issue. We will try to reproduce this issue.

@kalyanac
Copy link
Contributor

Can you share your app.yaml, requirements.txt, Dockerfile? I tried with Quickstart apps on Profiling Python applications  |  Cloud Profiler  |  Google Cloud and did not see that messages.

@drazzib
Copy link
Author

drazzib commented Jun 23, 2020

app.yaml:

runtime: custom
env: flex
manual_scaling:
  instances: 1
resources:
  cpu: 4
  memory_gb: 6
  disk_size_gb: 20
env_variables:
  FLASK_APP: main.py
  FLASK_ENV: production

requirements.txt:

wrapt==1.11.2
sklearn==0.0
Flask==1.1.1
gunicorn==20.0.4
future==0.18.2
h5py==2.10.0
google-cloud-storage==1.26.0
google-cloud-core==1.3.0
google-cloud-profiler==1.1.1
google-cloud-logging==1.15.0
pandas==1.0.2
datalab==1.1.5
Keras==2.3.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
tensorflow-cpu==2.1.0
cachetools==4.1.0

Dockerfile:

FROM gcr.io/google-appengine/python

# Create a virtualenv for dependencies. This isolates these packages from
# system-level packages.
# Use -p python3 or -p python3.7 to select python version. Default is version 2.
RUN virtualenv -p python3 /env

# Setting these environment variables are the same as running
# source /env/bin/activate.
ENV VIRTUAL_ENV /env
ENV PATH /env/bin:$PATH

# Copy the application's requirements.txt and run pip to install all
# dependencies into the virtualenv.
ADD requirements.txt /app/requirements.txt
RUN pip install --upgrade pip
RUN pip install -r /app/requirements.txt

# Add the application source code.
ADD . /app

# Run a WSGI server to serve the application. gunicorn must be declared as
# a dependency in requirements.txt.
CMD gunicorn -c gunicorn.conf.py -b :$PORT main:app

@drazzib
Copy link
Author

drazzib commented Jun 24, 2020

Hi @kalyanac & @jqll !

I've just seen that a new 1.1.2 release has been pushed to PyPI repository which contains 17d58aa commit.

Is this relevant for issue discussed above ?

@kalyanac
Copy link
Contributor

Hi @drazzib, No the new release is not related to this issue. We are still having trouble reproducing this error.

@kalyanac
Copy link
Contributor

Are you seeing any profiles collected despite these messages?

@drazzib
Copy link
Author

drazzib commented Jun 24, 2020

Yes, I can confirm that - even with this error printed very often on stderr
there is still some collected profile visible on Google StackDriver Profiler UI

@kalyanac
Copy link
Contributor

Thank you for the response. Few more questions:

  • What profile types do you see in the Profiler UI?
  • For each profile type, select a 10 minutes in Timespan, and check the profile count shown in the root node. The root node has the profile count displayed as "... averaged over n profiles". Please report what the n is.
  • If the service is not running currently, please set the end time to a time when the service was running by clicking "Now" and setting the end time.

@kalyanac
Copy link
Contributor

kalyanac commented Jul 1, 2020

Gentle ping.

@drazzib
Copy link
Author

drazzib commented Jul 5, 2020

Hi @kalyanac , sorry for the delay.

  • In Profiler UI, I see both "CPU Time" et "Wall time" for my application
  • For each profile type, under a 10 minutes in Timespan, I can see
    • for "CPU Time" 20.91 ms (100%), averaged over 11 profiles
    • for "Wall Time" 10 s (100%), averaged over 9 profiles

@kalyanac
Copy link
Contributor

kalyanac commented Jul 6, 2020

Hi @drazzib, it seems like the agent is uploading profiles at a rate we expect it to. So we don't anticipate any loss of profile data at the moment.

We'll try to reproduce this issue and root cause the source of the message.

@radyz
Copy link

radyz commented Jul 6, 2020

I just started having this same issue today with my CloudRun service

Libs:
google-cloud-profiler = 1.1.2
gunicorn = 20.0.4
Django = 3

 Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/googlecloudprofiler/pythonprofiler.py", line 180, in _handler
    if not self._started or self._in_handler:
BlockingIOError: [Errno 11] Resource temporarily unavailable" 

@kalyanac
Copy link
Contributor

kalyanac commented Jul 6, 2020

@radyz thanks for reporting it.

@jqll
Copy link
Contributor

jqll commented Jul 6, 2020

Thanks for reporting the issue.

I think the problem is that the app enabled signal wakeup fd somewhere. When the profiler profiles, it triggers a SIGALRM/SIGPROF signal every 10ms. Since too many signals arrive too quickly, it may cause the fd’s buffer to be full, see https://docs.python.org/3/library/signal.html#signal.set_wakeup_fd:

There are two common ways to use this function. In both approaches, you use the fd to wake up when a signal arrives, but then they differ in how they determine which signal or signals have arrived.

In the first approach, we read the data out of the fd’s buffer, and the byte values give you the signal numbers. This is simple, but in rare cases it can run into a problem: generally the fd will have a limited amount of buffer space, and if too many signals arrive too quickly, then the buffer may become full, and some signals may be lost. If you use this approach, then you should set warn_on_full_buffer=True, which will at least cause a warning to be printed to stderr when signals are lost.

In the second approach, we use the wakeup fd only for wakeups, and ignore the actual byte values. In this case, all we care about is whether the fd’s buffer is empty or non-empty; a full buffer doesn’t indicate a problem at all. If you use this approach, then you should set warn_on_full_buffer=False, so that your users are not confused by spurious warning messages.

This will not affect the profiler. But if the app definitely needs to access the signal numbers as described in the first approach above, you probably should not use the profiler.

@mgrdcm
Copy link

mgrdcm commented Jul 9, 2020

I'm also having this same thing happen and have for years. GAE Standard, both Python 3.7 and Python 3.8, Django 3.0.8, default Gunicorn setup. Also tried a while ago to use uwsgi and same thing happens.

I also notice that I don't get very useful information from within my python app inside of Profiler, which I wonder if is related.

@kalyanac
Copy link
Contributor

kalyanac commented Jul 9, 2020

@mgrdcm Can you open a new issue about not getting useful information from Profiler? We would like to understand what is happening and help address it.

@mgrdcm
Copy link

mgrdcm commented Jul 9, 2020

Something on this that I've only noticed in the last 24 hours on GAE Standard is this:

Traceback (most recent call last): File "/layers/google.python.pip/pip/googlecloudprofiler/pythonprofiler.py", line 165, in _handler def _handler(self, unused_signum, frame): BlockingIOError: [Errno 11] Resource temporarily unavailable

Previously I was only getting the same block as mentioned in the more generic errors in the original post:

BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:

@jqll
Copy link
Contributor

jqll commented Sep 11, 2020

Hi @mgrdcm, sorry for the late response. When you see the new error below, does the profile you get contain "unknown" frames? If that's the case, could you share some setup so that we can reproduce the problem? Thanks!

BlockingIOError: [Errno 11] Resource temporarily unavailable
Exception ignored when trying to write to the signal wakeup fd:

@mgrdcm
Copy link

mgrdcm commented Dec 13, 2020

@jqll I apologies, I missed this message until now. I stopped using cloud profiler due to this and a few other reasons. Has anything changed on this issue since September? If it hasn't been diagnosed or resolved I'm happy to add it back in and see if I'm still seeing the problem!

@kalyanac
Copy link
Contributor

kalyanac commented Jan 7, 2021

We have documented this in our known issues section of the documentation.

Closing because there is no further action expected on this issue.

@kalyanac kalyanac closed this as completed Jan 7, 2021
@kalyanac kalyanac reopened this Jan 7, 2021
@kalyanac kalyanac closed this as completed Jan 7, 2021
@mgrdcm
Copy link

mgrdcm commented Jan 8, 2021

@kalyanac I'm running a pretty basic Django app on GAE Standard, and I just searched through all of my code and all the dependencies and nothing is calling set_wakeup_fd.

@kalyanac
Copy link
Contributor

kalyanac commented Jan 8, 2021

It is probably set by one of the Django or any other imports Django uses.

@kalyanac
Copy link
Contributor

kalyanac commented Jan 8, 2021

Here is an example from gunicorn

@mgrdcm
Copy link

mgrdcm commented Jan 8, 2021

Heh, I was just arriving at that conclusion - it's gunicorn itself that's doing it, which is the default server that Google App Engine Standard uses! None of the rest of the code I'm using in my project (directly or indirectly) does it.

So it looks like right now there's no way to use google-cloud-profiler reliably on App Engine Standard. Unless we switch to a different non-default server?

@mgrdcm
Copy link

mgrdcm commented Jan 9, 2021

@kalyanac Because of ☝️ meaning that you can't use one Google product (this package) on another (Google App Engine Standard with Python 3) without this happening, it kinda feels like a bug that should still be open. Maybe GAE offers a patched version of gunicorn? Or this package has a workaround for this case?

@kalyanac
Copy link
Contributor

@mgrdcm This is not specific to gunicorn, any package or user code could set the same.
I appreciate your point regarding App Engine. We will post an update if we figure out a way to address this.

@zackarydev
Copy link

Has anyone discovered a temporary fix for this other than disabling the profiler?

@wyk9787
Copy link
Contributor

wyk9787 commented Mar 9, 2021

Has anyone discovered a temporary fix for this other than disabling the profiler?

You can find the current solution here: https://cloud.google.com/profiler/docs/troubleshooting#python-blocking. I will also update the README to mention this.

@mgrdcm
Copy link

mgrdcm commented Mar 10, 2021

Not a fix, but I switched to using the New Relic agent on GAE Standard Python 3 and it works wonderfully. Their new free tier is pretty generous, and so far I'm able to stay within it for my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants