Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many machines need process clean-up #770

Closed
adamfarley opened this issue Apr 12, 2019 · 11 comments
Closed

Many machines need process clean-up #770

adamfarley opened this issue Apr 12, 2019 · 11 comments
Assignees
Labels
Milestone

Comments

@adamfarley
Copy link
Contributor

test-osuosl-ppc64le-ubuntu-16-04-1 had dozens of jdi processes left over from tests that didn't clean up properly after themselves.

An issue has been raised for this, and test-osuosl-ppc64le-ubuntu-16-04-1 is now clean, but I suspect test-osuosl-ppc64le-ubuntu-16-04-2 needs the same treatment.

I found this command cleaned up most of the issues: pkill -f "openjdkbinary"

I recommend pausing the machine on Jenkins, and running that command once there's nothing else running on it.

You'll know if it works if running "ps -aux | grep "java"" fills the screen beforehand, and does not fill the screen afterwards.

Happy hunting!

@adamfarley adamfarley changed the title PPCLE machine needs process clean-up test-osuosl-ppc64le-ubuntu-16-04-2 needs process clean-up Apr 12, 2019
@karianna karianna added this to TODO in infrastructure via automation Apr 12, 2019
@karianna karianna added the bug label Apr 12, 2019
@karianna karianna added this to the 2019 April milestone Apr 12, 2019
@sxa
Copy link
Member

sxa commented Apr 15, 2019

@smlambert Do you have anything in place to mitigate this? We could use a multi-configuration jenkins job to run periodically over all the machines and kill anything that looks like it's hung. With only one jenkins executor per machine it should be fairly easy to determine what has been left around (i.e. just about everything running as the jenkins user apart from the jenkins agent and the job that's checking!)

@smlambert
Copy link
Contributor

We should likely put a clean up step in the setup stage of testing to look for and kill test related processes (that may have hung due to previous test jobs that were exited in a hung/bad state).

I also agree that a separate job that cleans up stray processes and files will be good and more thorough (as it can search for a broader range of processes to terminate). There is such a job in use at the OpenJ9 project, we can employ same/similar approach.

@AdamBrousseau
Copy link
Contributor

a job in use at the OpenJ9 project

https://github.com/eclipse/openj9/blob/master/buildenv/jenkins/jobs/infrastructure/Cleanup-Nodes
https://ci.eclipse.org/openj9/view/Infrastructure/job/Cleanup-Nodes/
https://ci.eclipse.org/openj9/view/Infrastructure/job/Sanitize-Nodes/

Note these are band-aid solutions which I don't like doing. The job has the option to run cleanup, sanitize or both. The jenkins agent is also killed in the sanitize path which we didn't like doing but found issues properly killing processes without killing the agent. Also on Windows it does a full reboot instead because of Cygwin issues.

@sxa
Copy link
Member

sxa commented Apr 15, 2019

@smlambert Quick and dirty check to identify the scope of the problem - the test machines in red here appear to have rogue java processes left around: https://ci.adoptopenjdk.net/view/work%20in%20progress/job/SXA-processCheck/

@smlambert
Copy link
Contributor

Thanks @sxa555

Your shell script does grep java, does this exclude the Jenkins agent itself which runs on all nodes?

I will edit your script to print the actual processes so we can better understand the root of the problem... - curious if processes are all leftover from openjdk tests or from other types of testing as well. If from openjdk test jobs, I will say we should additionally be looking at why the underlying framework can not or is not killing/cleaning at the end of the run (and raising an issue against it if so).

@sxa
Copy link
Member

sxa commented Apr 16, 2019

Your shell script does grep java, does this exclude the Jenkins agent itself which runs on all nodes?

No - that one is explicitly grepped out - so in principle it would be safe to have an option on the job to kill all the processes it has detected ...

I will edit your script to print the actual processes

The job as-is will already show them in the console logs :-)

@sxa
Copy link
Member

sxa commented Apr 16, 2019

Have removed the offline openlab CentOS machines to allow the jobs to complete - killing off runs 1 & 2 which are not completing because of this :-)

@smlambert
Copy link
Contributor

Ya, I have looked more closely at the hung processes and it is very specifically the jdi tests on openj9, we should disable those tests for now, as there appear to be multiple problems.

1 major problem being that some of those tests expect to query a HotspotDiagnosticBean.

It would be good for an OpenJDK contributor to review those tests to see what would be required to make them applicable to more than just hotspot implementations.

@sxa
Copy link
Member

sxa commented Apr 16, 2019

@adamfarley Can you adjust the title of this issue please now that we've seen that it does not appear to be specific to any one machine

@adamfarley adamfarley changed the title test-osuosl-ppc64le-ubuntu-16-04-2 needs process clean-up Many machines need process clean-up Apr 16, 2019
@sxa sxa modified the milestones: 2019 April, 2019 May Apr 30, 2019
@sxa sxa modified the milestones: 2019 May, 2019 June May 28, 2019
@sxa
Copy link
Member

sxa commented Aug 15, 2019

Have done a cleanup of most machiens over the last couple of days and we should try and keep the process check job clean now. Might move it into "production" state soon instead of work in progress

@sxa sxa self-assigned this Nov 1, 2019
@sxa
Copy link
Member

sxa commented Nov 1, 2019

This is being maintained reasonably well through the process cleanup job at https://ci.adoptopenjdk.net/view/Tooling/job/SXA-processCheck/configure therefore closing for now.

@sxa sxa closed this as completed Nov 1, 2019
infrastructure automation moved this from TODO to Done Nov 1, 2019
@sxa sxa added this to the November 2019 milestone Nov 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

No branches or pull requests

5 participants