Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e3sm_diags takes long time on cori-knl #314

Closed
tangq opened this issue Jul 15, 2020 · 6 comments
Closed

e3sm_diags takes long time on cori-knl #314

tangq opened this issue Jul 15, 2020 · 6 comments

Comments

@tangq
Copy link

tangq commented Jul 15, 2020

The e3sm_diags jobs created as parts of the post-processing bundling tool take forever to complete on cori.

One job ran out of time (2 hours). The other ran 4+ hours and still did not complete, so I killed it. The same job only uses <30 minutes on compy.

The log shows something like: OpenBLAS blas_thread_init: pthread_create failed for thread 108 of 128: Resource temporarily unavailable

The script, output, and log files are at /global/cscratch1/sd/tang30/E3SM_analysis/20200701.v1like.f2010.northamericax4v1pg2_r0125_northamericax4v1pg2.cori-knl/post/scripts

@zshaheen
Copy link
Contributor

The log shows something like: OpenBLAS blas_thread_init: pthread_create failed for thread 108 of 128: Resource temporarily unavailable

When I used to create the environmental YAML files, we'd sometimes have dumps with the OpenBLAS versions of some deps. I'm not sure if that was needed, since as per 3cb9ef3, it seems to not be used. Not sure if the unified environment uses OpenBLAS however.

@beharrop
Copy link

@tangq were you using KNL or Haswell nodes? In my own experience, e3sm_diags runs perfectly on Haswell nodes and performs miserably on KNL nodes. I assumed it was related to some module or environment difference. I haven't tried e3sm_diags on KNL in a while though, so maybe the fix @zshaheen pointed out fixed the issue I saw.

@tangq
Copy link
Author

tangq commented Jul 15, 2020

I used cori-knl nodes, which worked fine before.

@chengzhuzhang
Copy link
Contributor

chengzhuzhang commented Sep 4, 2020

It might be related,
When using 30 workers on cori-haswell. the diagnostics was completed within 12 min. When using the 32 workers (all workers on a haswell node) on cori-haswell, it took about 2 hrs to complete with Resource temporarily unavailable showing up.

And worth situation on knl, the same run didn't complete within 2 hrs. Not sure what caused the issue. This occurs after a cori system update on July.

@tangq
Copy link
Author

tangq commented Sep 8, 2020

July sounds like the same time when I noticed this issue.

@xylar
Copy link
Contributor

xylar commented Sep 8, 2020

I'm wondering if this has nothing to do with e3sm_diags specifically. MPAS-Analysis has also been terribly slow sometimes or even just hangs without doing anything. I have another python code that uses xarray and dask, and it saw similar behavior sometimes. I think there's something weird with the Cori file system, but I can't pinpoint it well enough to submit a thicket about it. Maybe unrelated to this issue, too...

@chengzhuzhang chengzhuzhang changed the title e3sm_diags takes long time on cori e3sm_diags takes long time on cori-knl Feb 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants