ci: Fix possible OOM error `Process completed with exit code 137` #409

akihironitta · 2020-11-26T17:29:04Z

🐛 Bug

Seems CI full testing / pytest (ubuntu-20.04, *, *) particularly tend to fail with the error:

/home/runner/work/_temp/5ef79e81-ccef-44a4-91a6-610886c324a6.sh: line 2:  1855 Killed                  coverage run --source pl_bolts -m pytest pl_bolts tests --exitfirst -v --junitxml=junit/test-results-Linux-3.7-latest.xml
Error: Process completed with exit code 137.

Example CI runs

This error might happen on different os or different versions. Haven't investigated yet.

To Reproduce

Not sure how to reproduce...

Additional context

Found while handling the dataset caching issue in #387 (comment).

The text was updated successfully, but these errors were encountered:

Borda · 2020-11-26T18:16:22Z

is it due to timeout? how long does it run before kill?

akihironitta · 2020-11-27T10:32:37Z

The two runs were killed after 7m 43s and 7m 20s, so I guess it's not due to timeout...

Borda · 2020-11-27T10:58:35Z

The two runs were killed after 7m 43s and 7m 20s, so I guess it's not due to timeout...

you are right, the CI timeout is 45min so it can be some random failer or it is always the same test configuration?

akihironitta · 2020-11-28T10:40:23Z

CI full testing / pytest (ubuntu-20.04, 3.8, latest) also fails in:

https://github.com/PyTorchLightning/pytorch-lightning-bolts/pull/403/checks?check_run_id=1463269846.

Seems it happens particularly on Ubuntu...(?)

akihironitta · 2020-11-28T10:45:46Z

It seems the following runs on Ubuntu were also unexpectedly killed due to probably the same reason:

ubuntu-20.04, 3.6, latest https://github.com/PyTorchLightning/pytorch-lightning-bolts/pull/400/checks?check_run_id=1463378727
ubuntu-20.04, 3.8, minimal https://github.com/PyTorchLightning/pytorch-lightning-bolts/pull/400/checks?check_run_id=1463378796
ubuntu-20.04, 3.8, latest https://github.com/PyTorchLightning/pytorch-lightning-bolts/pull/400/checks?check_run_id=1463378820

Haven't checked many runs yet, but I've never seen windows and macos processes were killed...

The above three runs were killed while testing tests/models/test_scripts.py::test_cli_run_vision_image_gpt.

error log

tests/models/test_scripts.py::test_cli_run_log_regression[--max_epochs 1 --max_steps 2] PASSED [ 51%]
/home/runner/work/_temp/ad14f744-ecd1-4908-beaa-c3ef9af1bc7c.sh: line 2:  2831 Killed                  coverage run --source pl_bolts -m py.test pl_bolts tests -v --junitxml=junit/test-results-Linux-3.8-latest.xml
tests/models/test_scripts.py::test_cli_run_vision_image_gpt[--data_dir /home/runner/work/pytorch-lightning-bolts/pytorch-lightning-bolts/datasets --max_epochs 1 --max_steps 2] 
Error: Process completed with exit code 137.

akihironitta · 2020-11-28T10:52:34Z

May OOM be the reason why the processes were killed? Stack Overflow - Process finished with exit code 137 in PyCharm

I will see if Ubuntu doesn't have as much memory as macOS and Windows on GitHub Actions...

akihironitta · 2020-11-28T11:06:01Z

As I checked the docs, all envs have the same hardware resources, but something might occupy a lot of memory particularly on Ubuntu.

2-core CPU
7 GB of RAM memory
14 GB of SSD disk space

GitHub Docs - Specifications for GitHub-hosted runners

akihironitta · 2020-11-28T11:50:43Z

Some of the runs were killed while testing the followings:

tests/models/self_supervised/test_models.py::test_moco
tests/models/self_supervised/test_scripts.py::test_cli_run_self_supervised_moco[--data_dir /home/runner/work/pytorch-lightning-bolts/pytorch-lightning-bolts/datasets --max_epochs 1 --max_steps 3 --fast_dev_run --batch_size 2]
tests/models/test_scripts.py::test_cli_run_vision_image_gpt

Does this imply that we should lower batch_size and/or other parameters?

akihironitta · 2020-11-29T12:40:35Z

I am getting confident that out-of-memory may be the cause of the kills as I checked memory usage with this script in akihironitta#1.

I observed the following output from https://github.com/akihironitta/pytorch-lightning-bolts/pull/1/checks?check_run_id=1468220929, which indicates that almost all memory was consumed while the tests were running (although the run was successful):

      date     time           total        used        free      shared  buff/cache   available
2020-11-28 21:05:53            6954         508        2774           3        3671        6145
2020-11-28 21:05:55            6954         517        2763           3        3673        6137
...
2020-11-28 21:13:48            6954        6768         106           1          80           3
2020-11-28 21:13:49            6954        6760         113           1          80          10
2020-11-28 21:13:51            6954        6778         104           1          70          53
2020-11-28 21:13:52            6954        6774         106           0          73           1
2020-11-28 21:13:53            6954        6771         104           0          77           1
...

[in MiB]

akihironitta · 2020-12-01T14:43:12Z

As I increased batch_size in akihironitta#1, all the runs failed due to the same error:

Error: Process completed with exit code 137.

, which probably implies that the memory size isn't enough for Bolts' tests.

So, I think the options we have are:

increase the memory size of CI
use smaller network architectures
~~use smaller batch_size~~ (I tried to reduce the batch size (2 to 1), but since some models use batch normalization the minimum required batch size is 2, which was confirmed in [wip] ci: Reduce memory usage to avoid unexpected kills #411)

akihironitta · 2020-12-05T02:54:41Z

@Borda Do you think we can increase the memory size or use smaller network architecture for all models?

It seems almost all the CI runs failed on the latest commit to master branch due to the same error.

Borda · 2020-12-16T15:17:04Z

I do not think we can di much about the memory, rather let's try to use a smaller model, is it possible?

akihironitta · 2020-12-22T06:09:53Z

@Borda For models with replaceable backbones, we can define smaller backbones and replace relatively big ones like resnet with small ones. I'm not sure if that's really possible, but at least I can try (hopefully this year...).

Borda · 2021-07-06T14:44:07Z

I think that @awaelchli solved it for PL with a shared memory argument
not the case as this issue was for GH actions, not Azure pipeline here we used docker images...

lest reopen if it appears again.. 🐰

alexander-soare · 2021-11-25T15:04:36Z

Hi team, sorry to be snooping around. I googled this issue as I'm having the same problem working on the timm library huggingface/pytorch-image-models#993. But I found that no individual tests trigger OOM, so it feels like it's something to do with memory not being released properly between tests. Wondering if you had similar experiences. Many thanks

connor-mccorm · 2022-08-19T00:02:53Z

Hi @alexander-soare was wondering if you had discovered anything regarding your comment above! I'm experiencing the same issue where no individual test triggers an OOM and your theory about memory releasing improperly seems like a viable explanation.

alexander-soare · 2022-08-19T08:44:46Z

@connor-mccorm unfortunately we never figured it out. We just came up with band-aid solutions: running the tests in individual chunks or with multiprocessing.

connor-mccorm · 2022-08-19T16:38:41Z

@alexander-soare good to know. Thanks for the information!

ducanhle31 · 2024-04-12T03:57:09Z

I'm having the same problem, how to fix?

akihironitta added fix fixing issues... help wanted Extra attention is needed ci/cd Continues Integration and delivery labels Nov 26, 2020

akihironitta mentioned this issue Nov 28, 2020

[wip] ci: Reduce memory usage to avoid unexpected kills #411

Closed

8 tasks

akihironitta mentioned this issue Dec 3, 2020

ci: Fix dataset downloading errors #387

Closed

8 tasks

akihironitta assigned akihironitta and Borda Dec 5, 2020

akihironitta mentioned this issue Jan 9, 2021

Adds Backbones to FRCNN Take 2 #475

Merged

8 tasks

Borda added this to the v0.3 milestone Jan 18, 2021

sidhantls mentioned this issue Jan 18, 2021

fixes DQN run_n_episodes using the wrong environment variable #525

Merged

8 tasks

Borda modified the milestones: v0.3, v0.4 Jan 22, 2021

akihironitta changed the title ~~ci: Fix Error: Process completed with exit code 137~~ ci: Fix possible OOM error Process completed with exit code 137 Mar 30, 2021

Borda closed this as completed Jul 19, 2021

Akaori mentioned this issue Aug 4, 2021

GitHub workflow Error: Process completed with exit code 137. testdrivenio/fastapi-tdd-docker#6

Closed

dfsp-spirit mentioned this issue Oct 26, 2022

CI: Some tests are killed on Github Actions with error 137 dfsp-spirit/meshlearn#13

Open

Borda added bug Something isn't working and removed fix fixing issues... labels Jun 20, 2023

guggero mentioned this issue Sep 7, 2023

proof: add magic bytes to individual proofs and proof files lightninglabs/taproot-assets#488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: Fix possible OOM error `Process completed with exit code 137` #409

ci: Fix possible OOM error `Process completed with exit code 137` #409

akihironitta commented Nov 26, 2020 •

edited

Borda commented Nov 26, 2020

akihironitta commented Nov 27, 2020

Borda commented Nov 27, 2020

akihironitta commented Nov 28, 2020

akihironitta commented Nov 28, 2020 •

edited

akihironitta commented Nov 28, 2020

akihironitta commented Nov 28, 2020 •

edited

akihironitta commented Nov 28, 2020

akihironitta commented Nov 29, 2020

akihironitta commented Dec 1, 2020 •

edited

akihironitta commented Dec 5, 2020 •

edited

Borda commented Dec 16, 2020

akihironitta commented Dec 22, 2020

Borda commented Jul 6, 2021 •

edited

alexander-soare commented Nov 25, 2021

connor-mccorm commented Aug 19, 2022

alexander-soare commented Aug 19, 2022

connor-mccorm commented Aug 19, 2022

ducanhle31 commented Apr 12, 2024

ci: Fix possible OOM error Process completed with exit code 137 #409

ci: Fix possible OOM error Process completed with exit code 137 #409

Comments

akihironitta commented Nov 26, 2020 • edited

🐛 Bug

To Reproduce

Additional context

Borda commented Nov 26, 2020

akihironitta commented Nov 27, 2020

Borda commented Nov 27, 2020

akihironitta commented Nov 28, 2020

akihironitta commented Nov 28, 2020 • edited

akihironitta commented Nov 28, 2020

akihironitta commented Nov 28, 2020 • edited

akihironitta commented Nov 28, 2020

akihironitta commented Nov 29, 2020

akihironitta commented Dec 1, 2020 • edited

akihironitta commented Dec 5, 2020 • edited

Borda commented Dec 16, 2020

akihironitta commented Dec 22, 2020

Borda commented Jul 6, 2021 • edited

alexander-soare commented Nov 25, 2021

connor-mccorm commented Aug 19, 2022

alexander-soare commented Aug 19, 2022

connor-mccorm commented Aug 19, 2022

ducanhle31 commented Apr 12, 2024

ci: Fix possible OOM error `Process completed with exit code 137` #409

ci: Fix possible OOM error `Process completed with exit code 137` #409

akihironitta commented Nov 26, 2020 •

edited

akihironitta commented Nov 28, 2020 •

edited

akihironitta commented Nov 28, 2020 •

edited

akihironitta commented Dec 1, 2020 •

edited

akihironitta commented Dec 5, 2020 •

edited

Borda commented Jul 6, 2021 •

edited