Memory issues due to requests not completing #124

jpswinski · 2022-04-28T15:46:31Z

When @SmithB was running YAPC processing requests on somewhat larger regions, the server's available memory plummeted but then either never came back or came back very slowly. As a result, the clients saw multiple heavy usage messages with retries. In addition, there were many cases where the servers never recovered and reset because they ran out of memory.

The initial version this was observed on was v1.4.0

Here is a snippet that recreates the problem:

coordinates=[[-60.491806, -64.28095],
    [-56.623029, -69.028066],
    [-60.603624, -71.175799],
    [-64.825318, -69.704525],
    [-66.571939, -67.96301],
    [-63.715481, -65.673347],
    [-60.491806, -64.28095]]
poly=[{'lon':coo[0], 'lat':coo[1]} for coo in coordinates]

res=20
cycle=1

params= { 'poly':poly,
            'cnf':0,
            'len':res,
             'res':res,
             'ats':res/2,
             'cnt':10,
             'cycle':cycle,
             'maxi': 1,
            'yapc':{"score":190, "knn":0, "win_h":3, "win_x":15},
            'pass_invalid':False}

D6=icesat2.atl06p(params, asset='nsidc-s3', version="005")

The text was updated successfully, but these errors were encountered:

jpswinski · 2022-04-28T15:48:52Z

It appears that an upstream factor is that many of the resources are taking an extremely long time to read... they are finishing, but they should complete reading in less than 10 seconds, and as can be seen, there aren't an extraordinary number of segments in some of the resources being read once they are complete.

INFO:sliderule.sliderule:... continuing to read ATL03_20181117181102_07650112_005_01.h5 (after 2100 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181129174549_09480112_005_01.h5 (after 2000 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181207172913_10700112_005_01.h5 (after 1900 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181120182818_08110112_005_01.h5 (after 2100 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181211172053_11310112_005_01.h5 (after 1690 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181125175410_08870112_005_01.h5 (after 2040 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181018200027_03080112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181014200847_02470112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181104190139_05670112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181019193448_03230112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181031190954_05060112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:processed 18873 segments in ATL03_20181015194309_02620112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181027191811_04450112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181022195209_03690112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:processed 1462 segments in ATL03_20181023192629_03840112_005_01.h5 (after 2160 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181203173732_10090112_005_01.h5 (after 1950 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181216164654_12070112_005_01.h5 (after 1330 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181212165513_11460112_005_01.h5 (after 1650 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181215171233_11920112_005_01.h5 (after 1600 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181108185326_06280112_005_01.h5 (after 2150 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181121180237_08260112_005_01.h5 (after 2090 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181117181102_07650112_005_01.h5 (after 2110 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181129174549_09480112_005_01.h5 (after 2010 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181207172913_10700112_005_01.h5 (after 1910 seconds)
INFO:sliderule.sliderule:... continuing to read ATL03_20181120182818_08110112_005_01.h5 (after 2110 seconds)

jpswinski · 2022-04-28T15:59:37Z

INFO:sliderule.sliderule:processed 2497 segments in ATL03_20181022195209_03690112_005_01.h5 (after 2900 seconds)

jpswinski · 2022-04-28T16:00:33Z

jpswinski · 2022-04-28T16:01:31Z

Those long runs in the image above occur because the processing requests are still running and therefore their memory is not being freed. For the lines that go back up, that occurs because long processing requests finally finish and free up their memory.

jpswinski · 2022-04-28T16:40:11Z

Big drops in available memory correspond to spikes in CPU usage on that same instance.

jpswinski · 2022-04-28T17:04:02Z

jpswinski · 2022-04-28T17:04:30Z

When I killed the client request, the server side freed all of the memory of the requests that were still being processed and the memory all came back.

jpswinski · 2022-06-14T12:37:36Z

Two changes were made to dramatically improve the situation:

the YAPC algorithm version was updated
the client was enhanced so that only three requests per node can be outstanding

jpswinski · 2022-06-14T12:38:33Z

Here are the updated stats from a recent run

jpswinski closed this as completed Jun 14, 2022

jpswinski transferred this issue from another repository Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issues due to requests not completing #124

Memory issues due to requests not completing #124

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Jun 14, 2022

jpswinski commented Jun 14, 2022

Memory issues due to requests not completing #124

Memory issues due to requests not completing #124

Comments

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Apr 28, 2022

jpswinski commented Jun 14, 2022

jpswinski commented Jun 14, 2022