Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about the ior-hard-write result when using MPIIO+collective #68

Open
xin3liang opened this issue Apr 24, 2024 · 7 comments
Open

Comments

@xin3liang
Copy link

xin3liang commented Apr 24, 2024

When I run ior-hard-write with API MPIIO+collective(opemMPI ROMIO)

[ior-hard-write]
# The API to be used
API = MPIIO
# Collective operation (for supported backends)
collective = TRUE
## io500.sh
io500_mpiargs="-hostfile /root/io500test/mpi-hosts --map-by node -np $np \
        -mca pml ucx \
        -mca btl ^openib \
        -mca io romio321 \
        -x ROMIO_FSTYPE_FORCE=lustre: \
         --allow-run-as-root "

With the same np=144, I find that although the running time is reduced a lot, but the bandwidth is smaller than the POSIX API
result (API: MPIIO+collective)

[RESULT]       ior-hard-write        2.423454 GiB/s : time 300.345 seconds

result (API: POSIX )

[RESULT]       ior-hard-write        3.770933 GiB/s : time 1843.400 seconds

Why the running time is better but the bandwidth is worse??

@xin3liang
Copy link
Author

xin3liang commented Apr 24, 2024

Software versions
OS: openEuler 22.03 LTS SP3, kernel 5.10.0-192.0.0.105.oe2203sp3
Lustre: 2.15.4
io500: io500-isc24_v3, master
openMPI: v4.1.x branch 4.1.7a1
UCX: 1.16.0

Lustre cluster:
network: 100Gib IB
filesystem_summary: 54.9T
client_num: 9, cores_per_node: 16, np: 144
Mdt: 24, ost: 96
Client(s): 9 client1,client10,client11,client13,client14,client2,client3,client4,client5
Server(s): 6 server1,server2,server3,server4,server5,server6
mgs nodes: 1 server1
mdts nodes: 6 server1 server2 server3 server4 server5 server6
osts nodes: 6 server1 server2 server3 server4 server5 server6

@xin3liang
Copy link
Author

@adilger
Copy link
Contributor

adilger commented Apr 24, 2024

It looks like the MPIIO run wrote about 860 GiB in 355s:
Using actual aggregate bytes moved = 923312332800

And the POSIX run wrote about 7153 GiB in 1897s:
Using actual aggregate bytes moved = 7680164783616

so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.

I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.

@xin3liang
Copy link
Author

It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800

And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616

so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high.

OK, this explains the question, thanks a lot @adilger.
I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling

I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage.

Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO.

It seems the communication overhead of MPI processes counts. Some researches may explain why:
http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf
https://phillipmdickens.github.io/pubs/paper1.pdf

@JulianKunkel
Copy link
Contributor

JulianKunkel commented Apr 24, 2024 via email

@xin3liang
Copy link
Author

Thanks, @JulianKunkel for the explanation, I have a more clear understanding now.
For independent I/O, after running 300s, all the processes still need to wait for the syncs to flush data to disks, right?
Because ior test specify the option -e fsync – perform fsync upon POSIX write close. So the whole time depends on how long the syncs will finish??

The reason for the time and performance difference is due to the fact that with MPI the IO is synchronized. The benchmark runs on each process independently for 300s, then they exchange information about how many I/Os were done. In the final stage each process writes the same number of I/Os. This is the stonewalling feature with wear-out. With MPI-IO all processes have every iteration the same number of I/Os done. Thus they finish quickly after 300s. With independent I/O some processes might be much faster than others thus leading to a long wear-out period. MPI-IO performance with collectives is only good in some cases, unfortunately.

On Wed, Apr 24, 2024 at 9:21 AM Xinliang Liu @.> wrote: It looks like the MPIIO run wrote about 860 GiB in 355s: Using actual aggregate bytes moved = 923312332800 And the POSIX run wrote about 7153 GiB in 1897s: Using actual aggregate bytes moved = 7680164783616 so the POSIX run wrote 8.3x as much data in 5.35x as much time, so the bandwidth is 1.55x as high. OK, this explains the question, thanks a lot @adilger https://github.com/adilger. I thought they both should finish the ior test and write the same file size Expected aggregate file size = 67691520000000, but the ior tests can be stopped by Maybe caused by deadlineForStonewalling I've always thought that MPIIO collective IO should be faster for the ior-hard-write phase, but for some reason it is not. That is far outside my area of expertise, so I can't speculate why it would be slower, but it should be aggregating the IO into large chunks and writing them linearly to the storage. Yes, according to my tests. For a small-scale cluster(1Gb TCP network, several HDD osts/mdts) the ior-hard-write phase bandwidth results rank is POSIX < ROMIO < OMPIO. But for a large-scale cluster(100Gb IB network, dozens of NVMe osts/mdts), the bandwidth results rank is POSIX > ROMIO >> OMPIO. It seems the communication overhead of MPI processes counts. Some researches may explain why: http://aturing.umcs.maine.edu/~phillip.dickens/pubs/Poster1.doc.pdf https://phillipmdickens.github.io/pubs/paper1.pdf — Reply to this email directly, view it on GitHub <#68 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGW5SUCZMOXFE2GS3MYHCLY65MRNAVCNFSM6AAAAABGWDY7ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZUGI2TAOBWGE . You are receiving this because you are subscribed to this thread.Message ID: @.>

@JulianKunkel
Copy link
Contributor

JulianKunkel commented Apr 24, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants