question about the ior-hard-write result when using MPIIO+collective #68

xin3liang · 2024-04-24T04:04:08Z

When I run ior-hard-write with API MPIIO+collective(opemMPI ROMIO)

[ior-hard-write]
# The API to be used
API = MPIIO
# Collective operation (for supported backends)
collective = TRUE

## io500.sh
io500_mpiargs="-hostfile /root/io500test/mpi-hosts --map-by node -np $np \
        -mca pml ucx \
        -mca btl ^openib \
        -mca io romio321 \
        -x ROMIO_FSTYPE_FORCE=lustre: \
         --allow-run-as-root "

With the same np=144, I find that although the running time is reduced a lot, but the bandwidth is smaller than the POSIX API
result (API: MPIIO+collective)

[RESULT]       ior-hard-write        2.423454 GiB/s : time 300.345 seconds

result (API: POSIX )

[RESULT]       ior-hard-write        3.770933 GiB/s : time 1843.400 seconds

Why the running time is better but the bandwidth is worse??

The text was updated successfully, but these errors were encountered:

xin3liang · 2024-04-24T04:07:41Z

Software versions
OS: openEuler 22.03 LTS SP3， kernel 5.10.0-192.0.0.105.oe2203sp3
Lustre: 2.15.4
io500: io500-isc24_v3, master
openMPI: v4.1.x branch 4.1.7a1
UCX: 1.16.0

Lustre cluster:
network: 100Gib IB
filesystem_summary: 54.9T
client_num: 9, cores_per_node: 16, np: 144
Mdt: 24, ost: 96
Client(s): 9 client1,client10,client11,client13,client14,client2,client3,client4,client5
Server(s): 6 server1,server2,server3,server4,server5,server6
mgs nodes: 1 server1
mdts nodes: 6 server1 server2 server3 server4 server5 server6
osts nodes: 6 server1 server2 server3 server4 server5 server6

xin3liang · 2024-04-24T04:36:36Z

ior-hard-write-mpiio.txt
ior-hard-write-posix.txt

adilger · 2024-04-24T05:34:16Z

It looks like the MPIIO run wrote about 860 GiB in 355s:
Using actual aggregate bytes moved = 923312332800

And the POSIX run wrote about 7153 GiB in 1897s:
Using actual aggregate bytes moved = 7680164783616

so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.

I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.

xin3liang · 2024-04-24T07:21:36Z

It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800

And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616

so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.

OK, this explains the question, thanks a lot @adilger.
I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling

I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.

Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO.

It seems the communication overhead of MPI processes counts. Some researches may explain why:
http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf
https://phillipmdickens.github.io/pubs/paper1.pdf

JulianKunkel · 2024-04-24T07:44:38Z

The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out. With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period. MPI-IO performance with collectives is only good in some cases, unfortunately.

…

On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu ***@***.***> wrote: It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK, this explains the question, thanks a lot @adilger <https://github.com/adilger>. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling I've always thought that MPIIO collective IO *should* be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it *should* be aggregating the IO into large chunks and writing them linearly to the storage. Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf — Reply to this email directly, view it on GitHub <#68 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

xin3liang · 2024-04-24T08:14:59Z

Thanks, @JulianKunkel for the explanation, I have a more clear understanding now.
For independent I/O, after running 300s, all the processes still need to wait for the syncs to flush data to disks, right?
Because ior test specify the option -e fsync – perform fsync upon POSIX write close. So the whole time depends on how long the syncs will finish??

The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out. With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period. MPI-IO performance with collectives is only good in some cases, unfortunately.
…
On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu @.> wrote: It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK, this explains the question, thanks a lot @adilger https://github.com/adilger. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage. Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf — Reply to this email directly, view it on GitHub <#68 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE . You are receiving this because you are subscribed to this thread.Message ID: @.>

JulianKunkel · 2024-04-24T08:31:51Z

With -e fsync, it depends on the file system. If it does not have a client sided write cache - your Lustre shouldn't - then the fsync doesn't add much. Data was already transferred to the servers during each I/O. On Wed, Apr 24, 2024 at 10:15 AM Xinliang Liu ***@***.***> wrote:

…

Thanks, @JulianKunkel <https://github.com/JulianKunkel> for the explanation, I have a more clear understanding now. For independent I/O, after running 300s, all the processes still need to wait for the syncs to flush data to disks, right? Because ior test specify the option -e fsync – perform fsync upon POSIX write close. So the whole time depends on how long the syncs will finish?? The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out. With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period. MPI-IO performance with collectives is only good in some cases, unfortunately. … <#m_7211590092082732659_> On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu *@*.*> wrote: It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK, this explains the question, thanks a lot @adilger <https://github.com/adilger> https://github.com/adilger <https://github.com/adilger>. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage. Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf <http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf> https://phillipmdickens.github.io/pubs/paper1.pdf <https://phillipmdickens.github.io/pubs/paper1.pdf> — Reply to this email directly, view it on GitHub <#68 (comment) <#68 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE <https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE> . You are receiving this because you are subscribed to this thread.Message ID: @.*> — Reply to this email directly, view it on GitHub <#68 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGW5STG2PPZNABBRDPCZXLY65SZXAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGM2TEMRSGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about the ior-hard-write result when using MPIIO+collective #68

question about the ior-hard-write result when using MPIIO+collective #68

xin3liang commented Apr 24, 2024 •

edited

xin3liang commented Apr 24, 2024 •

edited

xin3liang commented Apr 24, 2024

adilger commented Apr 24, 2024

xin3liang commented Apr 24, 2024

JulianKunkel commented Apr 24, 2024 via email

xin3liang commented Apr 24, 2024

JulianKunkel commented Apr 24, 2024 via email

question about the ior-hard-write result when using MPIIO+collective #68

question about the ior-hard-write result when using MPIIO+collective #68

Comments

xin3liang commented Apr 24, 2024 • edited

xin3liang commented Apr 24, 2024 • edited

xin3liang commented Apr 24, 2024

adilger commented Apr 24, 2024

xin3liang commented Apr 24, 2024

JulianKunkel commented Apr 24, 2024 via email

xin3liang commented Apr 24, 2024

JulianKunkel commented Apr 24, 2024 via email

xin3liang commented Apr 24, 2024 •

edited

xin3liang commented Apr 24, 2024 •

edited