ifgram_inversion: dask failure with large number of workers #518

falkamelung · 2021-02-12T21:59:12Z

Description of the problem
I am running ifgram_inversion.py using mintpy.compute.numWorker=all on stampede2 (48 cores per node, each 2 threads) and it thinks there are 96 cores:

------- start parallel processing using Dask -------
input Dask cluster type: local
initiate Dask cluster
scale the cluster to 96 workers
initiate Dask client

This value is returned by the python num_core = os.cpu_count() command in mintpy/objects/cluster.py. So we need another command to figure out the number of threads and divide. The lscpu command does return the proper amount of cores and threads on both Stampede2 and Frontera. I would expect that there is an equivalent in python.

Stampede:

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2100.000
BogoMIPS:              4200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

Frontera:

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz
Stepping: 7
CPU MHz: 3299.853
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities

The text was updated successfully, but these errors were encountered:

krasny2k5 · 2021-02-12T22:09:41Z

Hi!

I'm not sure if I understood correctly your problem, so sorry if not.

Stampede's CPU Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz has a total of 24 cores and 48 threads per processor (you can check it here) if you have two processors you have 96 total threads, which is what is reporting linux under /proc/cpuinfo, and dask is reporting 96threads which seems to be fine.

falkamelung · 2021-02-12T22:27:25Z

Yes, it reports 96 processors. I am not sure whether it should report this number but ifgram_inversion.py fails. In contrast, if I say mintpy.compute.numWorker=48 it works fine. I forgot to mention this, sorry.

Ovec8hkin · 2021-02-15T17:47:11Z

Im confused. lscpu says there are 96 CPUs, which is what the cpu_count finds. Thats what it supposed to be. Dask doesn't operate on "cores" it operates on "cpus" which are different things. There is no easy way to detect the number of cores as you defined them at runtime.

Ovec8hkin · 2021-02-15T18:39:42Z

@falkamelung It'll be much simpler for you to just set minty.compute.numWorker to 48 and not touch it anymore. Additional searches have turned up anything that will pull just the number of cores instead of the cpu count.

falkamelung · 2021-02-15T20:29:27Z

Hi @Ovec8hkin , are you saying you did not find a command to retrieve the 'Thread(s) per core' from the system? As I said, num_core = os.cpu_count() / num_cpu_threads would work. The issue is that we use the same mintpy installation on different queues with different hardware (different number of cores and threads) i.e. on nodes with different numbers of cores.

We could just run os.system('lscpu') but something pythonic would be preferred. If there is no immediate solution let me first try a few more systems to see whether os.system('lscpu') would indeed solve the problem.

Ovec8hkin · 2021-02-15T21:03:33Z

There is no way to get the values you want in pure python. You can get either the total number of CPUs (96) via psutil.cpu_count() or the number of cores per socket (24) via psutil.cpu_count(logical=False).

Ovec8hkin · 2021-02-15T21:04:14Z

Parsing 'lscpu' via os should not be done, as it resource intensive.

Ovec8hkin · 2021-02-22T15:00:12Z

@yunjunz Thoughts on this? I would really hesitate at parsing the output of lscpu in python looking for the proper values we want, as its a pretty limited use case, and I can't find any other means by which to get the number of cores Falk claims are available.

falkamelung · 2021-02-22T15:42:22Z

Are you sure the os.system('lscpu') is a problem? It would be done only once in both ifgram_inversion and dem_error

yunjunz · 2021-02-22T18:12:28Z

Using lscpu sounds fine with me, since it only affect the translation of all. Of course, we should explain it well in the comments of the function in cluster.py and next to the option in smallbaselineApp.cfg file.

Ovec8hkin · 2021-02-22T18:21:34Z

Using os.system('lscpu') is a messy work around for this, and doing things like that are generally not recommended by the python community, as they're comparatively expensive to using a native python command or full bash script. If @yunjunz doesn't mind, I guess you can go ahead and implement.

Send me a code block that pulls the correct data from the lscpu output, and I will add it in the right spot and document it.

yunjunz · 2021-11-09T08:46:36Z

Hi @falkamelung and @Ovec8hkin, I have implemented num_core = os.cpu_count() / num_cpu_threads as the translation for numWorker = all, as Falk suggested. Cheers.

falkamelung · 2021-11-09T18:36:19Z

Thank you. Just to let you know dask has still a problem. It normally does not work for me to use numWorker= all or numWorker=48 on stampede. I get weired dask connection errors. I normally use numWorker=36 which is a good compromise between failure frequency and speed. It would be good to find out why it fails if too many workers are given. I could provide examples.

hfattahi · 2021-11-09T18:43:19Z

@falkamelung Could it be that you run out of memory when you have many workers? Any idea how much memory you have on that machine and how much mintpy is using with 36 or 48 workers?

falkamelung · 2021-11-09T19:19:06Z

I don't think it is a memory problem. Here the error message that I get when I run a small dataset with 48 numWorkers. This dataset works fine with numWorkers=6. Below the output from the free command.

cat smallbaseline_wrapper_8740432.e
ls: cannot access mintpy/timeseries*: No such file or directory
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ab8e67b5970>>, <Task finished name='Task-947' coro=<Cluster._sync_cluster_info() done, defined at /work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:43192 after 30 s')>)
Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/tmp/rsmas_insar/3rdparty/miniconda3/lib/python3.8/asyncio/tasks.py", line 464, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 105, in _sync_cluster_info
    await self.scheduler_comm.set_metadata(
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/core.py", line 742, in live_comm
    comm = await connect(
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:43192 after 30 s
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ab8e67b5970>>, <Task finished name='Task-952' coro=<Cluster._sync_cluster_info() done, defined at /work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:43192 after 30 s')>)
Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/tmp/rsmas_insar/3rdparty/miniconda3/lib/python3.8/asyncio/tasks.py", line 464, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

 free -h
              total        used        free      shared  buff/cache   available
Mem:           187G         10G        164G        349M         12G        172G
Swap:            0B          0B          0B

EJFielding · 2021-11-10T00:05:53Z

The Linux server I was using recently has 16 cores and two threads per core. It ran successfully with 32 parallel dask processes. It was using files on a ZFS file system that is directly connected. I did not check how much memory was used, but the machine has 128 GB of RAM.

yunjunz · 2021-11-12T04:50:44Z

Thank you Falk and Eric for the info. This means we still have not located the cause yet. The "conflict during multiple HDF5 writing processes" as described in #692 still sounds like a reasonable guess to me.

Update: multiple HDF5 writing processes do not exist in mintpy, thus, that is not the cause.

We may revert the num_core = os.cpu_count() / num_cpu_threads translation in the future once the issue is cleared.

yunjunz · 2021-11-13T23:09:32Z

All the other errors, e.g. no access to file, negative dimension size, have been fixed by the PR above. The remaining re-occurring error is OSError: Timed out during handshake while connecting to tcp from dask. A quick search and try did not fix this issue completely on stampede.

To better handle this situation, I added the support of mintpy.compute.numWorker = 75% style input in the linked PR above.

falkamelung mentioned this issue May 3, 2021

Occasional ifgram_inversion.py failure when using too many cores with dask #556

Closed

yunjunz added the pull-request-welcome label Aug 15, 2021

yunjunz mentioned this issue Nov 9, 2021

parallel: divide num_core by num_thread if available #691

Merged

3 tasks

yunjunz closed this as completed in #691 Nov 9, 2021

yunjunz mentioned this issue Nov 9, 2021

Parallel HDF5 writing support with dask #692

Closed

yunjunz changed the title ~~ifgram_inversion.py (dask) does not properly recognize number of available cores~~ ifgram_inversion: dask failure with large number of workers Nov 12, 2021

yunjunz added the bug label Nov 12, 2021

yunjunz reopened this Nov 12, 2021

yunjunz mentioned this issue Nov 13, 2021

dask: support numWorker input in percentage with improved stability #694

Merged

5 tasks

yunjunz linked a pull request Nov 13, 2021 that will close this issue

dask: support numWorker input in percentage with improved stability #694

Merged

5 tasks

yunjunz removed a link to a pull request Nov 13, 2021

dask: support numWorker input in percentage with improved stability #694

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ifgram_inversion: dask failure with large number of workers #518

ifgram_inversion: dask failure with large number of workers #518

falkamelung commented Feb 12, 2021 •

edited

krasny2k5 commented Feb 12, 2021

falkamelung commented Feb 12, 2021

Ovec8hkin commented Feb 15, 2021

Ovec8hkin commented Feb 15, 2021

falkamelung commented Feb 15, 2021

Ovec8hkin commented Feb 15, 2021

Ovec8hkin commented Feb 15, 2021

Ovec8hkin commented Feb 22, 2021

falkamelung commented Feb 22, 2021

yunjunz commented Feb 22, 2021

Ovec8hkin commented Feb 22, 2021

yunjunz commented Nov 9, 2021

falkamelung commented Nov 9, 2021

hfattahi commented Nov 9, 2021

falkamelung commented Nov 9, 2021

EJFielding commented Nov 10, 2021 •

edited

yunjunz commented Nov 12, 2021 •

edited

yunjunz commented Nov 13, 2021

ifgram_inversion: dask failure with large number of workers #518

ifgram_inversion: dask failure with large number of workers #518

Comments

falkamelung commented Feb 12, 2021 • edited

krasny2k5 commented Feb 12, 2021

falkamelung commented Feb 12, 2021

Ovec8hkin commented Feb 15, 2021

Ovec8hkin commented Feb 15, 2021

falkamelung commented Feb 15, 2021

Ovec8hkin commented Feb 15, 2021

Ovec8hkin commented Feb 15, 2021

Ovec8hkin commented Feb 22, 2021

falkamelung commented Feb 22, 2021

yunjunz commented Feb 22, 2021

Ovec8hkin commented Feb 22, 2021

yunjunz commented Nov 9, 2021

falkamelung commented Nov 9, 2021

hfattahi commented Nov 9, 2021

falkamelung commented Nov 9, 2021

EJFielding commented Nov 10, 2021 • edited

yunjunz commented Nov 12, 2021 • edited

yunjunz commented Nov 13, 2021

falkamelung commented Feb 12, 2021 •

edited

EJFielding commented Nov 10, 2021 •

edited

yunjunz commented Nov 12, 2021 •

edited