Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues due to requests not completing #124

Closed
jpswinski opened this issue Apr 28, 2022 · 9 comments
Closed

Memory issues due to requests not completing #124

jpswinski opened this issue Apr 28, 2022 · 9 comments

Comments

@jpswinski
Copy link
Member

When @SmithB was running YAPC processing requests on somewhat larger regions, the server's available memory plummeted but then either never came back or came back very slowly. As a result, the clients saw multiple heavy usage messages with retries. In addition, there were many cases where the servers never recovered and reset because they ran out of memory.

The initial version this was observed on was v1.4.0

Here is a snippet that recreates the problem:

coordinates=[[-60.491806, -64.28095],
    [-56.623029, -69.028066],
    [-60.603624, -71.175799],
    [-64.825318, -69.704525],
    [-66.571939, -67.96301],
    [-63.715481, -65.673347],
    [-60.491806, -64.28095]]
poly=[{'lon':coo[0], 'lat':coo[1]} for coo in coordinates]

res=20
cycle=1

params= { 'poly':poly,
            'cnf':0,
            'len':res,
             'res':res,
             'ats':res/2,
             'cnt':10,
             'cycle':cycle,
             'maxi': 1,
            'yapc':{"score":190, "knn":0, "win_h":3, "win_x":15},
            'pass_invalid':False}

D6=icesat2.atl06p(params, asset='nsidc-s3', version="005")
@jpswinski
Copy link
Member Author

It appears that an upstream factor is that many of the resources are taking an extremely long time to read... they are finishing, but they should complete reading in less than 10 seconds, and as can be seen, there aren't an extraordinary number of segments in some of the resources being read once they are complete.

INFO:sliderule.sliderule:... continuing to read ATL03_20181117181102_07650112_005_01.h5 (after 2100 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181129174549_09480112_005_01.h5 (after 2000 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181207172913_10700112_005_01.h5 (after 1900 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181120182818_08110112_005_01.h5 (after 2100 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181211172053_11310112_005_01.h5 (after 1690 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181125175410_08870112_005_01.h5 (after 2040 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181018200027_03080112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181014200847_02470112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181104190139_05670112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181019193448_03230112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181031190954_05060112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:processed 18873 segments in ATL03_20181015194309_02620112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181027191811_04450112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181022195209_03690112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:processed 1462 segments in ATL03_20181023192629_03840112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181203173732_10090112_005_01.h5 (after 1950 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181216164654_12070112_005_01.h5 (after 1330 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181212165513_11460112_005_01.h5 (after 1650 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181215171233_11920112_005_01.h5 (after 1600 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181108185326_06280112_005_01.h5 (after 2150 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181121180237_08260112_005_01.h5 (after 2090 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181117181102_07650112_005_01.h5 (after 2110 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181129174549_09480112_005_01.h5 (after 2010 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181207172913_10700112_005_01.h5 (after 1910 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181120182818_08110112_005_01.h5 (after 2110 seconds)

@jpswinski
Copy link
Member Author

INFO:sliderule.sliderule:processed 2497 segments in ATL03_20181022195209_03690112_005_01.h5 (after 2900 seconds)

@jpswinski
Copy link
Member Author

memory_available

@jpswinski
Copy link
Member Author

Those long runs in the image above occur because the processing requests are still running and therefore their memory is not being freed. For the lines that go back up, that occurs because long processing requests finally finish and free up their memory.

@jpswinski
Copy link
Member Author

Big drops in available memory correspond to spikes in CPU usage on that same instance.

@jpswinski
Copy link
Member Author

ctrl-c-long-running-process

@jpswinski
Copy link
Member Author

When I killed the client request, the server side freed all of the memory of the requests that were still being processed and the memory all came back.

@jpswinski
Copy link
Member Author

Two changes were made to dramatically improve the situation:

  • the YAPC algorithm version was updated
  • the client was enhanced so that only three requests per node can be outstanding

@jpswinski
Copy link
Member Author

Here are the updated stats from a recent run
big_yapc_stats

@jpswinski jpswinski transferred this issue from another repository Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant