We can specify the number of slots in the host file. Slots can be interpreted as number
of available processors on the host. If the slots are not specified, the number of slots
defaults to one.
2. Which functions do you use for retrieving the rank of an MPI process and the total number of processes?
Rank of an process: MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Total number of prosesses: MPI_Comm_size(MPI_COMM_WORLD, &size);
The MPI standard requires that MPI_Send send call blocks until the buffer is safe to
reuse. Similarly, the Standard requires that MPI_Recv call blocks until the receive
buffer actually contains the intended message. To sum up, we need wait the function
to return until the communication is finished.
2. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
2. How does the performance of binary tree reduction compare to the performance of linear reduction?
The performances are nearly identical, only a slight difference when processes = 8.
3. Increasing the number of processes, which approach (linear/tree) is going to perform better? Why? Think about the number of messages and their costs.
I think tree will perform better, because linear approach requires master process to
recieve message from all other slave processes. However, in tree approach, the overhead
of communication will reduce from O(n) to O(lg(n)).
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
MPI non-blocking functions usually start with the letter 'I', here are some examples.
MPI_Iallgather
MPI_Iallgatherv
MPI_Iallreduce
MPI_Ialltoall
MPI_Ialltoallv
MPI_Ibarrier
MPI_Ibsend
MPI_Igather
MPI_Igatherv
MPI_Iprobe
MPI_Irecv
MPI_Ireduce
MPI_Ireduce_scatter
MPI_Ireduce_scatter_block
MPI_Irsend
MPI_Iscatter
MPI_Iscatterv
MPI_Isend
MPI_Issend
3. How the performance of non-blocking communication compares to the performance of blocking communication?
The performances are nearly identical, only a slight difference when processes = 12. Theoretically, non-blocking communication has the advantage of doing other things while the non-blocking function completes.
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
1. Measure the performance (execution time) of the code for 2, 4, 8, 12, 16 MPI processes and plot it.
2. Which approach gives the best performance among the 1.2.1-1.2.6 cases? What is the reason for that?
Other than the one-sided communication, the implementations have nearly no difference in terms of performance. I think it is because the program doesn't heavily rely on communication between process but calculating Monte-Carlo.
- Case 1 bandwidth = (1 / 1.5594563042333037e-10) / 1E-9 = 6.4 GB/s latency = 0.0003616054620147061 / 1E-6 = 0.36 ms
- Case 2 bandwidth = (1 / 8.641607908748788e-09) / 1E-9 = 864 MB/s latency = 0.0013578094881199364 / 1E-6 = 1.35 ms
- Partition a_mat evenly in blocks of rows to each process. The remaining rows will be assigned to master process.
- Transpose b_mat to avoid cache miss, since accessing the matrix in row fashion increase the performance with the help of spatial locality.










