Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme CPU load ~3100% during variable read from netcdf #264

Closed
durack1 opened this issue Aug 8, 2018 · 18 comments
Closed

Extreme CPU load ~3100% during variable read from netcdf #264

durack1 opened this issue Aug 8, 2018 · 18 comments
Assignees
Labels
Milestone

Comments

@durack1
Copy link
Member

durack1 commented Aug 8, 2018

I have noticed some extreme CPU usage during variable read from a netcdf file, an example below:

import cdms2 as cdm
f = '/work/durack1/Shared/obs_data/Argo/UCSD/180719_UCSD_monthly_TAndS_200401-201806_2p5-1975db.nc'
fH = cdm.open(f)
testAxes = fH('to')

And from top:

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
103217 durack1   20   0 7727m 5.1g  29m R 3100.0  2.3   7:48.84 anaconda2/envs/cdat80py2/bin/python -m spyder_kernels.console

Conda env info:

(cdat80py2) duro@ocean:[8800]> ls anaconda2/envs/cdat80py2/conda-meta/cdtime*
anaconda2/envs/cdat80py2/conda-meta/cdtime-3.0-py27h6091dcd_1.json
(cdat80py2) duro@ocean:[8801]> ls anaconda2/envs/cdat80py2/conda-meta/cdms2*
anaconda2/envs/cdat80py2/conda-meta/cdms2-3.0.1-py27h6091dcd_1.json

@dnadeau4 @zshaheen @doutriaux1

@zshaheen
Copy link
Contributor

zshaheen commented Aug 8, 2018

@durack1 Thanks for confirming what I've been seeing. I haven't tested 3.0.1 yet, but I'll try to confirm what you have.

In the meantime, @dnadeau4 and @doutriaux1 we need a test for this. It should query ps or top and check the resources used. That's not hard, but I have no idea how that would be ran in the run_tests.py that cdms is using. Cause I think that uses multiprocessing in and of itself.

@durack1
Copy link
Member Author

durack1 commented Aug 8, 2018

@zshaheen no problem, I've seen ~1200, ~1500 but the new record of 3100% today was too much not to list this issue.. Let me know if you want me to dig a little deeper into the libraries that comprise the cdat80py2 environment, it was created 1st August 2018 14:22 PDT for FYI

@dnadeau4
Copy link
Contributor

dnadeau4 commented Aug 8, 2018

@zshaheen thanks for doing this.

@zshaheen
Copy link
Contributor

zshaheen commented Aug 10, 2018

@durack1 @doutriaux1 @dnadeau4

I have horrible news. I just tested 3.0.1 of cdms and I'm getting the issue. I followed the instructions I had in #248.

With cdms like this, even running e3sm_diags with 4 processes spawns so many children processes that it can't even complete. The 12+ users of our (@chengzhuzhang and I) software are having similar issues.

@durack1
Copy link
Member Author

durack1 commented Aug 10, 2018

@zshaheen same hash as the build I am using above?

@zshaheen
Copy link
Contributor

zshaheen commented Aug 10, 2018

@durack1 Yes:

(e3sm_diags_env_1.3.5a) shaheen2@acme1:~/github/acme_diags/tests/system$ ls ~/anaconda2/envs/e3sm_diags_env_1.3.5a/conda-meta/cdms2*
/export/shaheen2/anaconda2/envs/e3sm_diags_env_1.3.5a/conda-meta/cdms2-3.0.1-py27h6091dcd_1.json
(e3sm_diags_env_1.3.5a) shaheen2@acme1:~/github/acme_diags/tests/system$ ls ~/anaconda2/envs/e3sm_diags_env_1.3.5a/conda-meta/cdtime*
/export/shaheen2/anaconda2/envs/e3sm_diags_env_1.3.5a/conda-meta/cdtime-3.0-py27h6091dcd_1.json

@durack1
Copy link
Member Author

durack1 commented Aug 10, 2018

Maybe a rebuild with careful compiler and library selection will fix things? I believe most machines I’ve seen issues are RHEL 6.10

@durack1
Copy link
Member Author

durack1 commented Aug 10, 2018

I haven’t checked Mac

@dnadeau4
Copy link
Contributor

dnadeau4 commented Aug 14, 2018

@zshaheen do you have a test?

@dnadeau4
Copy link
Contributor

dnadeau4 commented Aug 14, 2018

@durack1 this file does not exist.

'/work/durack1/Shared/obs_data/Argo/UCSD/180719_UCSD_monthly_TAndS_200401-201806_2p5-1975db.nc'

@zshaheen
Copy link
Contributor

zshaheen commented Aug 14, 2018

@dnadeau4 not an automated test, but you can manually follow these instructions. This should only take 1 min or so.

@durack1
Copy link
Member Author

durack1 commented Aug 14, 2018

@dnadeau4 this path was ocean, it's now on crunch

@dnadeau4
Copy link
Contributor

dnadeau4 commented Aug 14, 2018

#248

set these environment variables.
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1

@durack1
Copy link
Member Author

durack1 commented Aug 14, 2018

@dnadeau4 great, this fixes me for the issue above and likely another problem with me maxing out thread counts on some other cron jobs.

@pochedls - this will likely also effect you with the xml cron/spawning, my test case for a single thread operation created ~20 suspended threads, x40 for the xml scan (along with other processes) this will start getting close to the 1024 hard limit imposed for each user

@durack1
Copy link
Member Author

durack1 commented Aug 14, 2018

@gleckler1 @lee1043 @pochedls the env variables above #264 (comment) should be added in anytime you're going to start heavy (multithread) computing, I've been hitting problems for a number of months now, in my test this contains the issue

@durack1
Copy link
Member Author

durack1 commented Aug 15, 2018

@dnadeau4 thanks again for this, I've been testing all afternoon and the highest thread count I've seen is ~130%, and the scripts are running MUCH MUCH faster.. Thanks again.

@durack1
Copy link
Member Author

durack1 commented Aug 15, 2018

@dnadeau4 bad news, this problem is not fixed. I have set the envs:

(cdat80py2) duro@ocean:[180606_PaperPlots_UpperDeepWarming]:[9352]> env | grep NUM_THREADS
OPENBLAS_NUM_THREADS=1
OMP_NUM_THREADS=1

And it seems this is not controlling threads, with a new thread added every 3 secs that the script runs:

top - 11:22:25 up 20 days, 21:19, 34 users,  load average: 2.19, 2.23, 2.32
Tasks:  87 total,   1 running,  86 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.6%us,  1.3%sy,  0.0%ni, 96.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  231325120k total, 229465760k used,  1859360k free,  5111920k buffers
Swap:        0k total,        0k used,        0k free, 196023468k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                    
133509 durack1   20   0 32.7g 200m  23m R 88.4  0.1   4:10.45 python make_ohcFromArgo.py                 133551 durack1   20   0 32.7g 200m  23m S 42.2  0.1   0:01.76 python make_ohcFromArgo.py                 133515 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:06.27 python make_ohcFromArgo.py                 133519 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:06.10 python make_ohcFromArgo.py                 133521 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:06.15 python make_ohcFromArgo.py                 133523 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:06.19 python make_ohcFromArgo.py                 133533 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:06.31 python make_ohcFromArgo.py                 133547 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:07.13 python make_ohcFromArgo.py                 133549 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:07.15 python make_ohcFromArgo.py                 133553 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133555 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133557 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133562 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133564 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133566 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133568 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133570 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133572 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133574 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133576 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133579 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133581 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133583 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133585 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133587 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133589 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133591 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133593 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133595 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133597 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133599 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133601 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133603 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133605 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133607 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133609 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133611 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133613 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133615 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133617 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133619 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133621 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133623 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133625 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133627 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133629 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133631 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133633 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133635 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133637 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133639 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133641 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133643 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133645 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133647 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 133649 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133651 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133653 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133655 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133657 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133659 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133664 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133666 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133668 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133670 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133672 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133674 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133676 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133678 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133680 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133682 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133684 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133686 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133688 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133690 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133692 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133696 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133698 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133700 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133704 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133707 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133709 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133711 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133715 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133720 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133722 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py                 
133724 durack1   20   0 32.7g 200m  23m S  0.0  0.1   0:00.00 python make_ohcFromArgo.py

@durack1
Copy link
Member Author

durack1 commented Aug 15, 2018

@dnadeau4 this issue is better represented in #248 - so maybe this should be closed and I should reproduce the same there - I'll do that.. closing here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants