Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceWarning: unclosed socket error #106

Closed
slhowardESR opened this issue Jul 25, 2022 · 5 comments
Closed

ResourceWarning: unclosed socket error #106

slhowardESR opened this issue Jul 25, 2022 · 5 comments

Comments

@slhowardESR
Copy link

Hi JP,

I am doing some testing - preparing for the large run, and sometimes I get this problem, will kills the program.

`sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x19391b4c280>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x19391b4c400>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x195403fc640>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
sys:1: ResourceWarning: unclosed socket <zmq.Socket(zmq.PUSH) at 0x19604735dc0>
ResourceWarning: Enable tracemalloc to get the object allocation traceback
Traceback (most recent call last):

File D:\Jupyter\sliderulework\Spyder_SR\SR_by_RGT_5files_print_status.py:108 in
main()

File D:\Jupyter\sliderulework\Spyder_SR\SR_by_RGT_5files_print_status.py:81 in main
gdf = icesat2.atl06p(parmsyp, version=args.release,

File d:\jupyter\sliderule-python\sliderule\icesat2.py:881 in atl06p
return __parallelize(callback, __atl06, parm, resources, asset)

File d:\jupyter\sliderule-python\sliderule\icesat2.py:597 in __parallelize
result, resource = future.result()

File ~\anaconda3\envs\sliderule\lib\concurrent\futures_base.py:437 in result
return self.__get_result()

File ~\anaconda3\envs\sliderule\lib\concurrent\futures_base.py:389 in __get_result
raise self._exception

File ~\anaconda3\envs\sliderule\lib\concurrent\futures\thread.py:57 in run
result = self.fn(*self.args, **self.kwargs)

File d:\jupyter\sliderule-python\sliderule\icesat2.py:459 in __atl06
rsps = sliderule.source("atl06", rqst, stream=True)

File d:\jupyter\sliderule-python\sliderule\sliderule.py:425 in source
__clrserv(serv, stream)

File d:\jupyter\sliderule-python\sliderule\sliderule.py:184 in __clrserv
server_table[serv]["pending"] -= 1

KeyError: 'http://34.212.131.26'`

I am not sure what is causing this. I can send you my code. I am basically trying to do SR-YAPC processing for individual rgt in region 10 and 12. I am running two regions, separate processes, at once.
Some times it works. and Sometimes it crashes.

let me know if you need more info

@jpswinski
Copy link
Member

jpswinski commented Jul 25, 2022

@slhowardESR I've not seen this before. I wonder if the client is spawning too many threads and making too many concurrent connections. Can you send me the code you are using? If you want, you can send it to me via slack so the code is kept private. I will run things on my side and see if I can recreate and diagnose the problem.

jpswinski added a commit that referenced this issue Jul 29, 2022
…concurrent threads to be removed, only the first wins and the others just pass
@jpswinski
Copy link
Member

@slhowardESR after some investigation, there appears to be a number of different things happening on your processing runs:

  • The ResourceWarning: unclosed socket <zmq... is likely not a problem. It is a warning that the underlying code is not closing a socket that it is no longer using. In this case, it is coming from ZeroMQ, which is likely coming from your Spyder environment.

  • The KeyError crash is due to a bug in the SlideRule Python client that is triggered by a race condition in the code. I fixed the bug and will be creating a new release of the code today. You can do a git pull on main or wait for the conda update.

  • The underlying problem though is that server nodes are going down due to not having any available memory while processing your requests. From my testing, it looks like there are maybe three granules (maybe more) in your test runs that take so much memory to process that the backend nodes processing them go down. There is a short term fix for this that will work most of the time - but really the long term fix is to rework the backend server code to be more memory efficient.

    • The short term fix is to add the line sliderule.set_max_pending(1) right after your call to icesat2.init...; You will also have to add from sliderule import sliderule to your imports at the top. This call makes it so that each backend server node only processes one granule at a time; the default is three. Since YAPC takes almost 100% of the CPU, processing one granule at a time did not seem to affect the overall processing time that much.
    • The long term fix is more involved and so I have created another GitHub Issue to track its progress: Processing runs on full granules consume all resources (in some cases) sliderule#117

@jpswinski
Copy link
Member

susan_run_3_v_1

@jpswinski
Copy link
Member

The above snapshot shows a run with max pending set to 3, and then again with max pending set to 1. When set to 1, there are no servers that bounce, though some of the memory dips are pretty low.

@jpswinski
Copy link
Member

The temporary fix of setting the max pending to 1 seems to have worked well. All future development on this issue will be tracked under SlideRuleEarth/sliderule#117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants