Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Fix auto retries when out of memory. #1108

Merged
merged 3 commits into from Mar 1, 2019

Conversation

Gerhut
Copy link
Member

@Gerhut Gerhut commented Aug 20, 2018

Opened for CI.

@coveralls
Copy link

coveralls commented Aug 20, 2018

Coverage Status

Coverage increased (+0.01%) to 52.639% when pulling 1d67e9b on qixcheng/rest-server/disable-quota-oom-retries into dd37f6c on master.

Jenkinsfile Outdated
\\"command\\": \\"nodejs index 128\\",
\\"portList\\": []
}
]
}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Gerhut ,

Could you rebase on master and remove changes on Jenkinsfile?
You'd better test your fix on your own environment.

Jenkinsfile Outdated
\\"command\\": \\"nodejs index 128\\",
\\"portList\\": []
}
]
}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too.

@Gerhut
Copy link
Member Author

Gerhut commented Aug 31, 2018

Sure, I'll revert the Jenkins test before being merged

@@ -234,7 +242,7 @@ ln -s /tmp/pai-root/log/$APP_ID/$CONTAINER_ID/DockerContainerDebug.log $LAUNCHER
docker pull {{{ jobData.image }}} \
|| { echo "Can not pull Docker image"; exit 1; }
docker run --name $docker_name \
--rm \
--detach \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So docker container won't be removed automatically.
Where do you clean up stopped containers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 62 in exit_handler

@hao1939 hao1939 force-pushed the qixcheng/rest-server/disable-quota-oom-retries branch from 20e97d8 to 003fdc7 Compare August 31, 2018 09:44
@Gerhut Gerhut force-pushed the qixcheng/rest-server/disable-quota-oom-retries branch from 387d6e5 to 1b825af Compare September 6, 2018 10:01
@Gerhut Gerhut force-pushed the qixcheng/rest-server/disable-quota-oom-retries branch 2 times, most recently from 46a629e to a812230 Compare December 3, 2018 03:05
@yqwang-ms
Copy link
Member

You can also refer k8s launcher:
Docker OOMKilled may be due to exceed the Container memory limit or the Container workload spike or OS memory pressure, so it may be Permanent or Transient, Internal or External, so it should be an unknown failure which will count against maxretrycount.
See: https://github.com/Microsoft/frameworkcontroller/blob/8adcef25f6b00f2c903cba9fb85e8c69d3a02d98/pkg/apis/frameworkcontroller/v1/constants.go#L181

@Gerhut Gerhut force-pushed the qixcheng/rest-server/disable-quota-oom-retries branch from a812230 to d1f25e5 Compare February 25, 2019 06:17
Make all OOM cause exiting by 5
@fanyangCS
Copy link
Contributor

please treat it as higher priority.

@Gerhut Gerhut changed the title [WIP]Fix auto retries when out of memory of user quota. Fix auto retries when out of memory. Mar 1, 2019
@Gerhut Gerhut merged commit 953d655 into master Mar 1, 2019
@Gerhut Gerhut deleted the qixcheng/rest-server/disable-quota-oom-retries branch March 1, 2019 03:19

printf "[DEBUG] Write exit code $rc to file /var/lib/hadoopdata/nm-local-dir/nmPrivate/$APP_ID/$CONTAINER_ID/$CONTAINER_ID.pid.exitcode.\n"
docker container rm $docker_name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Gerhut
We can't guarantee exit_handle will be executed, all codes here are besteffort. So the container might be left on the host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but there is an abnormal state when exit_handle does not be executed. The legacy containers removing should be rely on cleaner.

@zhiltsov-max
Copy link
Contributor

zhiltsov-max commented Mar 12, 2019

@Gerhut, hi, I'm working out on a solution for the same issue. I think there some unaccounted circumstances can occur, which we should handle there. Let's see.

  1. User task logs can be pretty big (50 MiB, for example), so I don't think printing the container logs in the exit handler is a good idea. They're also already should be printed in the stdout log file. Similarly, docker inspect - I don't think it is useful there.
  2. We are running the container with --oom-score-adj, meaning "if we have little free memory, kill something in this container", as the value is inherited by child processes. Thus, in theory, in the case of OOM any container child process can be killed, including the bootstrap script and the init process. In practice, however, the user task or its children gets killed more often (probably, as they use more resources and get higher priority). This means we have not only to check exited container OOM status, but also kernel log using dmesg or something else.
  3. BTW, dmesg has its own buffer, which not too big. It can overflow if, for example, the user task has spawned too many subprocesses - dmesg prints their memory usage, so for truly reliable output we probably should use something else, like system log. Anyway, we can expect this situation to be rare and stay with dmesg, hoping the killing log message is still there, as it probably has happened recently.
  4. exit_handler is executed before the script exits, and it can exit in several circumstances - including the system shutdown (TERM signal) and (correct me if I'm wrong) after the stop button in the web UI has been pressed. We should provide some time for "graceful stop" of the container, propagating the request to the user command. I'd suggest using docker stop there. We'll also require the some signal handling and propagation in the docker bootstrap script (I have an example of the implementation). Otherwise, the user can lost task results.
  5. As was said before, in some cases exit handler is not invoked. This is not a big deal if some kind of regular trash-reaping job is present. So, probably it's not a problem.
  6. The user command can expect OOM conditions for a one of the processes - probably, we can leave it on its own in this case as the task will probably be specifically prepared for this. Example: there is some not important subtask (to say, intermediate model testing), which is expensive, but can be skipped and should not fail the task. The user command can expect the failure and handle it properly. So, probably it's not a problem.
  7. I can't find debug_log function in the script, maybe it is a bug.

I suggest considering this variant:

function get_docker_container_id()
{
  docker inspect --format {{.Id}} $1
}

function check_docker_oom_killed()
{
  docker inspect --format {{.State.OOMKilled}} $1
}

function check_kernel_oom_killer()
{
  dmesg | grep -i "kill" | grep "/docker/$container_id"
}

function exit_handler()
{
  rc=$?
  printf "[DEBUG] EXIT signal received in yarn container, performing clean up action...\n"

  docker stop --time=${PAI_GRACEFUL_EXIT_TIMEOUT:-35} $docker_name >/dev/null

  # Here: container (i.e. init process) or the user task have been killed
  # 137, 143 - not handled TERM and KILL signals
  if [[ "$rc" = "137" || "$rc" = "143" || "$(check_docker_oom_killed $docker_name)" ]]; then
    if [ "$(check_docker_oom_killed $docker_name)" = "true" ] || \
         check_kernel_oom_killer $(get_docker_container_id $docker_name) >/dev/null; then \
      echo "[ERROR] One of the container processes has been killed by the OOM killer"
      rc=55 # Some undefined erroneous exit code
    fi
  fi

  docker rm -f $docker_name >/dev/null

  echo $rc > "/var/lib/hadoopdata/nm-local-dir/nmPrivate/$APP_ID/$CONTAINER_ID/$CONTAINER_ID.pid.exitcode"

  exit $rc
}

@fanyangCS
Copy link
Contributor

fanyangCS commented Mar 13, 2019

@zhiltsov-max , great suggestion! We also think of some points you raised previously (e.g., oom-score).
But overall we plan to let k8s to solve all the hassles, instead of hacking it one-by-one (see #2195 for details).

We plan to do some prototyping after this sprint (March release). Hence this PR is likely a short-term solution.

Meanwhile, we would love to hear your comments on #2195.

And last but not least, we welcome your contribution.

sunqinzheng added a commit that referenced this pull request Mar 20, 2019
* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error
sunqinzheng added a commit that referenced this pull request Mar 28, 2019
* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error
qfyin pushed a commit that referenced this pull request Apr 1, 2019
* add installation guide for VS code extension (#2223)

* add installation guide for VS code extension

* [VS Code] view container (#2301)

* add a dashboard in grafana to list all tasks in node (#2197)

* Fix format in issue templates (#2233)

Fix format in issue templates:
- remove trailing spaces
- change chinese colon into english

* Fix auto retries when out of memory. (#1108)

* Distinguish cgroup OOM from dmesg.

* Remove cgroup OOM detection

Make all OOM cause exiting by 5

* Exit 55 when OOM

* Refine homepage for new users (#2155)

Updated first level bullets, to add more content for administrators and users, who is first time touch OpenPAI, or computing platform.

* Fix yarn container failed when docker container exited quickly. (#2256)

* REST server: remove expires in JWT payload of unit test (#2263)

* Deploy: add explicit config field in webportal  plugin (#2251)

* Deploy: add explicit config field in webportal  plugin

* Fix json.dumps

* t

* fix

* Update PLUGINS.md

* Update webportal.md

* alert on unhealthy gpu (#2209)

* Pylon: fix double start query in yarn redirect (#2258)

* Pylon: fix double start query in yarn redirect

* Hide debug info in docker-compose.yaml

* adapt user transfer script to new config (#2266)

* Webportal: add pai-version attribute to <pai-plugin> (#2245)

* Webportal: add pai-version attribute to <pai-plugin>

* Use preprocess to apply window.PAI_VERSION

* set version in layout.html

* Fix ib drivers bug (#2269)

* FIx ib installation script bug (#2271)

* [BUG] Fix hadoop ai build path (#2262)

* fix hadoop ai build bugs

* refine

* Web portal submit job: support init json from sessionStorage. (#2253)

* YARN and HDFS log persistence  (#2244)

* rm log persist

* change log dir to host

* persist nm log to host

* resolve conflict

* persist namenode log

* persist data node log

* add comments

* move log path to common pai storage

* use twisted in yarn-exporter (#2273)

* [Job Debugging] Basic Implement Of Job Debugging. (#2272)

* Refine document for new user to submit job (#2278)

1. add new guidance to submit job for beginners.
2. refine homepage to connect with new guidance.
3. reorganize content of troubleshooting for next refactoring.
4. fix links in md files.

* [Drivers] Fix the issue when installing IB drivers.  (#2275)

* fix can not report zombie process using gpu error (#2279)

* fix external process error

* add debug log

* fix short ID and long ID do not match

* use time based atomic ref to exchange info between threads

* add test case for AtomicRef

* fix bug in file remove (#2288)

* fix hadoop build error (#2296)

* export vc/node related metrics from yarn (#2289)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* fix lint error

* [VS Code] default to generate jsonc job config file  (#2368)

* 720

* open hdfs explorer in view container
enable tslint rule "ordered-imports"

* add tslint rule for indent

* add home button to hdfs explorer's navigation;
adjust octicon's color

* fix lint error

* [VS Code] Add job list (#2160)

* add job list view to pai extension

* [VS Code] joblist fix (#2185)

* eager load recent jobs when job submitted

* avoid eager getChildren, and let vscode treeview.reveal do it implicitly

* default to generate jsonc job config file

* [VS Code] Refine error messages; Fix Cluster Explorer's bug

* [VS Code] changelog and readme (#2429)

* 539

* 712

* 452

* [VS Code] v0.11 compatible issue (#2457)

* 536

* 600

* [VS Code] fix cluster explorer's right-click menu (#2463)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants