Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ifgram_inversion: dask failure with large number of workers #518

Open
falkamelung opened this issue Feb 12, 2021 · 18 comments · Fixed by #691
Open

ifgram_inversion: dask failure with large number of workers #518

falkamelung opened this issue Feb 12, 2021 · 18 comments · Fixed by #691

Comments

@falkamelung
Copy link
Contributor

falkamelung commented Feb 12, 2021

Description of the problem
I am running ifgram_inversion.py using mintpy.compute.numWorker=all on stampede2 (48 cores per node, each 2 threads) and it thinks there are 96 cores:

------- start parallel processing using Dask -------
input Dask cluster type: local
initiate Dask cluster
scale the cluster to 96 workers
initiate Dask client

This value is returned by the python num_core = os.cpu_count() command in mintpy/objects/cluster.py. So we need another command to figure out the number of threads and divide. The lscpu command does return the proper amount of cores and threads on both Stampede2 and Frontera. I would expect that there is an equivalent in python.

Stampede:

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                96
On-line CPU(s) list:   0-95
Thread(s) per core:    2
Core(s) per socket:    24
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
Stepping:              4
CPU MHz:               2100.000
BogoMIPS:              4200.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              1024K
L3 cache:              33792K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_ppin intel_pt ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke spec_ctrl intel_stibp flush_l1d

Frontera:

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 1
Core(s) per socket: 28
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz
Stepping: 7
CPU MHz: 3299.853
CPU max MHz: 4000.0000
CPU min MHz: 1000.0000
BogoMIPS: 5400.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 39424K
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities
@krasny2k5
Copy link

Hi!

I'm not sure if I understood correctly your problem, so sorry if not.

Stampede's CPU Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz has a total of 24 cores and 48 threads per processor (you can check it here) if you have two processors you have 96 total threads, which is what is reporting linux under /proc/cpuinfo, and dask is reporting 96threads which seems to be fine.

@falkamelung
Copy link
Contributor Author

Yes, it reports 96 processors. I am not sure whether it should report this number but ifgram_inversion.py fails. In contrast, if I say mintpy.compute.numWorker=48 it works fine. I forgot to mention this, sorry.

@Ovec8hkin
Copy link
Contributor

Im confused. lscpu says there are 96 CPUs, which is what the cpu_count finds. Thats what it supposed to be. Dask doesn't operate on "cores" it operates on "cpus" which are different things. There is no easy way to detect the number of cores as you defined them at runtime.

@Ovec8hkin
Copy link
Contributor

@falkamelung It'll be much simpler for you to just set minty.compute.numWorker to 48 and not touch it anymore. Additional searches have turned up anything that will pull just the number of cores instead of the cpu count.

@falkamelung
Copy link
Contributor Author

Hi @Ovec8hkin , are you saying you did not find a command to retrieve the 'Thread(s) per core' from the system? As I said, num_core = os.cpu_count() / num_cpu_threads would work. The issue is that we use the same mintpy installation on different queues with different hardware (different number of cores and threads) i.e. on nodes with different numbers of cores.

We could just run os.system('lscpu') but something pythonic would be preferred. If there is no immediate solution let me first try a few more systems to see whether os.system('lscpu') would indeed solve the problem.

@Ovec8hkin
Copy link
Contributor

There is no way to get the values you want in pure python. You can get either the total number of CPUs (96) via psutil.cpu_count() or the number of cores per socket (24) via psutil.cpu_count(logical=False).

@Ovec8hkin
Copy link
Contributor

Parsing 'lscpu' via os should not be done, as it resource intensive.

@Ovec8hkin
Copy link
Contributor

@yunjunz Thoughts on this? I would really hesitate at parsing the output of lscpu in python looking for the proper values we want, as its a pretty limited use case, and I can't find any other means by which to get the number of cores Falk claims are available.

@falkamelung
Copy link
Contributor Author

Are you sure the os.system('lscpu') is a problem? It would be done only once in both ifgram_inversion and dem_error

@yunjunz
Copy link
Member

yunjunz commented Feb 22, 2021

Using lscpu sounds fine with me, since it only affect the translation of all. Of course, we should explain it well in the comments of the function in cluster.py and next to the option in smallbaselineApp.cfg file.

@Ovec8hkin
Copy link
Contributor

Using os.system('lscpu') is a messy work around for this, and doing things like that are generally not recommended by the python community, as they're comparatively expensive to using a native python command or full bash script. If @yunjunz doesn't mind, I guess you can go ahead and implement.

Send me a code block that pulls the correct data from the lscpu output, and I will add it in the right spot and document it.

@yunjunz
Copy link
Member

yunjunz commented Nov 9, 2021

Hi @falkamelung and @Ovec8hkin, I have implemented num_core = os.cpu_count() / num_cpu_threads as the translation for numWorker = all, as Falk suggested. Cheers.

@falkamelung
Copy link
Contributor Author

Thank you. Just to let you know dask has still a problem. It normally does not work for me to use numWorker= all or numWorker=48 on stampede. I get weired dask connection errors. I normally use numWorker=36 which is a good compromise between failure frequency and speed. It would be good to find out why it fails if too many workers are given. I could provide examples.

@hfattahi
Copy link
Collaborator

hfattahi commented Nov 9, 2021

@falkamelung Could it be that you run out of memory when you have many workers? Any idea how much memory you have on that machine and how much mintpy is using with 36 or 48 workers?

@falkamelung
Copy link
Contributor Author

I don't think it is a memory problem. Here the error message that I get when I run a small dataset with 48 numWorkers. This dataset works fine with numWorkers=6. Below the output from the free command.

cat smallbaseline_wrapper_8740432.e
ls: cannot access mintpy/timeseries*: No such file or directory
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ab8e67b5970>>, <Task finished name='Task-947' coro=<Cluster._sync_cluster_info() done, defined at /work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:43192 after 30 s')>)
Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/tmp/rsmas_insar/3rdparty/miniconda3/lib/python3.8/asyncio/tasks.py", line 464, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 741, in _run_callback
    ret = callback()
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
    future.result()
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 105, in _sync_cluster_info
    await self.scheduler_comm.set_metadata(
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/core.py", line 785, in send_recv_from_rpc
    comm = await self.live_comm()
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/core.py", line 742, in live_comm
    comm = await connect(
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/comm/core.py", line 324, in connect
    raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:43192 after 30 s
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x2ab8e67b5970>>, <Task finished name='Task-952' coro=<Cluster._sync_cluster_info() done, defined at /work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/deploy/cluster.py:104> exception=OSError('Timed out during handshake while connecting to tcp://127.0.0.1:43192 after 30 s')>)
Traceback (most recent call last):
  File "/work2/05861/tg851601/stampede2/code/rsmas_insar/3rdparty/miniconda3/lib/python3.8/site-packages/distributed/comm/core.py", line 319, in connect
    handshake = await asyncio.wait_for(comm.read(), time_left())
  File "/tmp/rsmas_insar/3rdparty/miniconda3/lib/python3.8/asyncio/tasks.py", line 464, in wait_for
    raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError

 free -h
              total        used        free      shared  buff/cache   available
Mem:           187G         10G        164G        349M         12G        172G
Swap:            0B          0B          0B

@EJFielding
Copy link

EJFielding commented Nov 10, 2021

The Linux server I was using recently has 16 cores and two threads per core. It ran successfully with 32 parallel dask processes. It was using files on a ZFS file system that is directly connected. I did not check how much memory was used, but the machine has 128 GB of RAM.

@yunjunz
Copy link
Member

yunjunz commented Nov 12, 2021

Thank you Falk and Eric for the info. This means we still have not located the cause yet. The "conflict during multiple HDF5 writing processes" as described in #692 still sounds like a reasonable guess to me.

Update: multiple HDF5 writing processes do not exist in mintpy, thus, that is not the cause.

We may revert the num_core = os.cpu_count() / num_cpu_threads translation in the future once the issue is cleared.

@yunjunz yunjunz changed the title ifgram_inversion.py (dask) does not properly recognize number of available cores ifgram_inversion: dask failure with large number of workers Nov 12, 2021
@yunjunz yunjunz added the bug label Nov 12, 2021
@yunjunz yunjunz reopened this Nov 12, 2021
@yunjunz yunjunz linked a pull request Nov 13, 2021 that will close this issue
5 tasks
@yunjunz
Copy link
Member

yunjunz commented Nov 13, 2021

All the other errors, e.g. no access to file, negative dimension size, have been fixed by the PR above. The remaining ​re-occurring error is OSError: Timed out during handshake while connecting to tcp from dask. A quick search and try did not fix this issue completely on stampede.

To better handle this situation, I added the support of mintpy.compute.numWorker = 75% style input in the linked PR above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants