# Exercise 2
This exercise consists in running an MPI microbenchmark in order to examine the impact of HPC topologies on performance.
### Description
The OSU Micro-Benchmarks suite holds multiple benchmarks that measure low-level performance properties such as latency and bandwidth between MPI ranks. Specifically, for this exercise, we are interested in the _point-to-point_ ones, which exchange messages between 2 MPI ranks.
### Tasks
#### Download and build the OSU Micro-Benchmarks 
available at http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.2.tar.gz. You can also use available binaries on LCC2 at `/scratch/c703429/osu-benchmark/libexec/osu-micro-benchmarks/mpi/pt2pt` (built with `openmpi/4.0.1`). Note: If you build yourself, do not forget to set the compiler parameters for `configure`, e.g. `./configure CC=mpicc CXX=mpic++ ...`

According to the README there are also CUDA options, but we should not need them here. I downloaded and build the OSU-Benchmarks using the following bash script.
```bash
mkdir benchmark
cd benchmark
wget http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.2.tar.gz
tar -zxvf osu-micro-benchmarks-5.6.2.tar.gz
module load openmpi/4.0.1
./configure CC=mpicc CXX=mpic++
make
```
#### After building, submit SGE jobs that run the `osu_latency` and `osu_bw` executables.
I am using the the scripts `osu_bw.job` and `osu_latency.job`, which can be found within the `benchmarks` directory. Both are running on 2 slots per node with 2 slots in total. To be fair, I find the `Xperhost Y` syntax very confusing, so bare with me.
#### Create a table and figures that illustrate the measured data and study them. What effects can you observe?
The tables and figures are stored as csv files and pngs within the `benchmarks` directory and also below.


In [107]:
import plotly.express as px
import plotly.io as pio
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from IPython.display import display_html

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [108]:
DIFF_CORES_SA_SO = "benchmark/diff_cores_same_socket"
DIFF_NODES = "benchmark/diff_nodes"
DIFF_SOCKET_SA_NO = "benchmark/diff_socket_same_node"

def parse_benchmark_output(filepath, xname, yname) -> pd.DataFrame:
    with open(filepath) as infile:
        bm = infile.readlines()[3:]
        bm = [bw.split() for bw in bm]
    bm_df = pd.DataFrame(bm, columns=[xname, yname])
    return bm_df   
        

In [109]:
osu_bw_df = parse_benchmark_output("benchmark/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_bw_df.to_csv('benchmark/osu_bw.out.csv', sep='\t', encoding='utf-8')
fig = px.line(osu_bw_df, x='Size', y='Bandwidth(MB/s)', title='OSU Bandwidth (2 slots)')
fig.show()
# This will only work with plotly orca installed
pio.write_image(fig, "benchmark/osu_bw_linechart.png")

On this graphic we can see the osu_bw benchmark. On very short message sizes we achieve around 500 to 1000 MB/s. Then, the bandwidth increases until a size of 262144 Bytes is reached. After a short dip in bandwidth performance at around .5MB, we see an increase up to over 2500 MB/s. Only after a size of 2MB do we see a decrease in bandwidth. At 4MB we achieve a little under 1000 MB/s.

In [110]:
osu_lat_df = parse_benchmark_output("benchmark/osu_latency.out", 'Size', 'Latency(us)')
osu_bw_df.to_csv('benchmark/osu_latency.out.csv', sep='\t', encoding='utf-8')
fig = px.line(osu_lat_df, x='Size', y='Latency(us)', title='OSU Latency (2 slots)')
fig.show()
# This will only work with plotly orca installed
pio.write_image(fig, "benchmark/osu_lat_linechart.png")

The latency continuently increases with the message size, until the message is larger than 2MB. The maximum latency is over 3500 microseconds.

As tables these would look like this:

In [111]:
display_side_by_side(osu_bw_df.set_index(['Size', 'Bandwidth(MB/s)']), osu_lat_df.set_index(['Size', 'Latency(us)']))

Size,Bandwidth(MB/s)
1,4.3
2,8.67
4,1.14
8,34.36
16,60.3
32,120.74
64,210.18
128,356.92
256,602.7
512,187.82

Size,Latency(us)
0,0.4
1,2.24
2,9.39
4,1.24
8,1.44
16,3.56
32,4.36
64,5.06
128,4.18
256,1.32


## Modify your experiment such that the 2 MPI ranks are placed on
###  Different cores of the same socket

In [112]:
osu_bw_df = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_lat_df = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_latency.out", 'Size', 'Latency(us)')
osu_bw_df.to_csv(DIFF_CORES_SA_SO+'/osu_bw.out.csv', sep='\t', encoding='utf-8')
osu_lat_df.to_csv(DIFF_CORES_SA_SO+'/osu_latency.out.csv', sep='\t', encoding='utf-8')
fig = make_subplots(rows=1, cols=2,
    subplot_titles=("OSU Bandwidth","OSU Latency"))
fig.add_trace(go.Scatter(x=osu_bw_df['Size'], y=osu_bw_df['Bandwidth(MB/s)'], mode='markers+lines'), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_lat_df['Size'], y=osu_lat_df['Latency(us)'], mode='markers+lines'), row=1, col=2)
fig.update_layout(showlegend=False, title_text="Different cores on the same socket")
fig.update_xaxes(title_text='Size')
fig.update_yaxes(title_text='Bandwidth(MB/s)', row=1, col=1)
fig.update_yaxes(title_text='Latency(us)', row=1, col=2)
fig.show()
pio.write_image(fig, DIFF_CORES_SA_SO+"/linechart.png")

When running the benchmarks on two cores of the same CPU, we would expect both high bandwidth and low latency. Both theories seem to be accurate, as the bandwidth more than doubles when compared to the first experiment. It only falls when the message sizes go above 4MBs. Latency is only a little bit less when compared to the first experiment. It has a much better performance at the 2MB mark though.

### Different nodes

In [113]:
osu_bw_df = parse_benchmark_output(DIFF_NODES+"/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_lat_df = parse_benchmark_output(DIFF_NODES+"/osu_latency.out", 'Size', 'Latency(us)')
osu_bw_df.to_csv(DIFF_NODES+'/osu_bw.out.csv', sep='\t', encoding='utf-8')
osu_lat_df.to_csv(DIFF_NODES+'/osu_latency.out.csv', sep='\t', encoding='utf-8')

fig = make_subplots(rows=1, cols=2,
    subplot_titles=("OSU Bandwidth","OSU Latency"))
fig.add_trace(go.Scatter(x=osu_bw_df['Size'], y=osu_bw_df['Bandwidth(MB/s)'], mode='markers+lines'), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_lat_df['Size'], y=osu_lat_df['Latency(us)'], mode='markers+lines'), row=1, col=2)
fig.update_layout(showlegend=False, title_text="Different nodes")
fig.update_xaxes(title_text='Size')
fig.update_yaxes(title_text='Bandwidth(MB/s)', row=1, col=1)
fig.update_yaxes(title_text='Latency(us)', row=1, col=2)
fig.show()
pio.write_image(fig, DIFF_NODES+"/linechart.png")

The bandwidth between two nodes was not expected to be very high. But it seems that it is limited to around 1500 MB/s, or at least thats where the limit is. This limit is reached very quicky, at around 130000 Bytes. 
Latency on the other hand is linear with the message size.

### Different sockets of the same node

In [114]:
osu_bw_df = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_lat_df = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_latency.out", 'Size', 'Latency(us)')
osu_bw_df.to_csv(DIFF_SOCKET_SA_NO+'/osu_bw.out.csv', sep='\t', encoding='utf-8')
osu_lat_df.to_csv(DIFF_SOCKET_SA_NO+'/osu_latency.out.csv', sep='\t', encoding='utf-8')

fig = make_subplots(rows=1, cols=2,
    subplot_titles=("OSU Bandwidth","OSU Latency"))
fig.add_trace(go.Scatter(x=osu_bw_df['Size'], y=osu_bw_df['Bandwidth(MB/s)'], mode='markers+lines'), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_lat_df['Size'], y=osu_lat_df['Latency(us)'], mode='markers+lines'), row=1, col=2)
fig.update_layout(showlegend=False, title_text="Different sockets of the same node")
fig.update_xaxes(title_text='Size')
fig.update_yaxes(title_text='Bandwidth(MB/s)', row=1, col=1)
fig.update_yaxes(title_text='Latency(us)', row=1, col=2)
fig.show()
pio.write_image(fig, DIFF_SOCKET_SA_NO+"/linechart.png")

Here we reach similar highs when compared to the bandwidth of same cpu, different core. Interestingly, the maximum bandwidth is higher when compared to different node communication; but it is lower up until 2MB.

## What happens if we run it multiple times?
In these experiments, there seem to be a either one or mutiple outcomes that are very different to the others. One reason could be that I submitted the tasks all at once which means the resources get more limited. This would explain why the benchmark that ran on the same node but multiple sockets was the most consistent in the test.

In [115]:
osu_bw_df_0 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_1 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_bw.out.1", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_2 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_bw.out.2", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_3 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_bw.out.3", 'Size', 'Bandwidth(MB/s)')

osu_lat_df_0 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_latency.out", 'Size', 'Latency(us)')
osu_lat_df_1 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_latency.out.1", 'Size', 'Latency(us)')
osu_lat_df_2 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_latency.out.2", 'Size', 'Latency(us)')
osu_lat_df_3 = parse_benchmark_output(DIFF_CORES_SA_SO+"/osu_latency.out.3", 'Size', 'Latency(us)')

fig = make_subplots(rows=1, cols=2,
    subplot_titles=("OSU Bandwidth","OSU Latency"))
fig.add_trace(go.Scatter(x=osu_bw_df_0['Size'], y=osu_bw_df_0['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_1['Size'], y=osu_bw_df_1['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_2['Size'], y=osu_bw_df_2['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_3['Size'], y=osu_bw_df_3['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)

fig.add_trace(go.Scatter(x=osu_lat_df_0['Size'], y=osu_lat_df_0['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_1['Size'], y=osu_lat_df_1['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_2['Size'], y=osu_lat_df_2['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_3['Size'], y=osu_lat_df_3['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)

fig.update_layout(showlegend=False, title_text="Different cores of the same socket")
fig.update_xaxes(title_text='Size')
fig.update_yaxes(title_text='Bandwidth(MB/s)', row=1, col=1)
fig.update_yaxes(title_text='Latency(us)', row=1, col=2)

fig.show()
pio.write_image(fig, DIFF_CORES_SA_SO+"/multiple-lines.png")

In [116]:
osu_bw_df_0 = parse_benchmark_output(DIFF_NODES+"/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_1 = parse_benchmark_output(DIFF_NODES+"/osu_bw.out.1", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_2 = parse_benchmark_output(DIFF_NODES+"/osu_bw.out.2", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_3 = parse_benchmark_output(DIFF_NODES+"/osu_bw.out.3", 'Size', 'Bandwidth(MB/s)')

osu_lat_df_0 = parse_benchmark_output(DIFF_NODES+"/osu_latency.out", 'Size', 'Latency(us)')
osu_lat_df_1 = parse_benchmark_output(DIFF_NODES+"/osu_latency.out.1", 'Size', 'Latency(us)')
osu_lat_df_2 = parse_benchmark_output(DIFF_NODES+"/osu_latency.out.2", 'Size', 'Latency(us)')
osu_lat_df_3 = parse_benchmark_output(DIFF_NODES+"/osu_latency.out.3", 'Size', 'Latency(us)')

fig = make_subplots(rows=1, cols=2,
    subplot_titles=("OSU Bandwidth","OSU Latency"))
fig.add_trace(go.Scatter(x=osu_bw_df_0['Size'], y=osu_bw_df_0['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_1['Size'], y=osu_bw_df_1['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_2['Size'], y=osu_bw_df_2['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_3['Size'], y=osu_bw_df_3['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)

fig.add_trace(go.Scatter(x=osu_lat_df_0['Size'], y=osu_lat_df_0['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_1['Size'], y=osu_lat_df_1['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_2['Size'], y=osu_lat_df_2['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_3['Size'], y=osu_lat_df_3['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)

fig.update_layout(showlegend=False, title_text="Different nodes")
fig.update_xaxes(title_text='Size')
fig.update_yaxes(title_text='Bandwidth(MB/s)', row=1, col=1)
fig.update_yaxes(title_text='Latency(us)', row=1, col=2)

fig.show()
pio.write_image(fig, DIFF_NODES+"/multiple-lines.png")

In [117]:
osu_bw_df_0 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_bw.out", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_1 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_bw.out.1", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_2 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_bw.out.2", 'Size', 'Bandwidth(MB/s)')
osu_bw_df_3 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_bw.out.3", 'Size', 'Bandwidth(MB/s)')

osu_lat_df_0 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_latency.out", 'Size', 'Latency(us)')
osu_lat_df_1 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_latency.out.1", 'Size', 'Latency(us)')
osu_lat_df_2 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_latency.out.2", 'Size', 'Latency(us)')
osu_lat_df_3 = parse_benchmark_output(DIFF_SOCKET_SA_NO+"/osu_latency.out.3", 'Size', 'Latency(us)')

fig = make_subplots(rows=1, cols=2,
    subplot_titles=("OSU Bandwidth","OSU Latency"))
fig.add_trace(go.Scatter(x=osu_bw_df_0['Size'], y=osu_bw_df_0['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_1['Size'], y=osu_bw_df_1['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_2['Size'], y=osu_bw_df_2['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)
fig.add_trace(go.Scatter(x=osu_bw_df_3['Size'], y=osu_bw_df_3['Bandwidth(MB/s)'], line=dict(color='royalblue')), row=1, col=1)

fig.add_trace(go.Scatter(x=osu_lat_df_0['Size'], y=osu_lat_df_0['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_1['Size'], y=osu_lat_df_1['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_2['Size'], y=osu_lat_df_2['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)
fig.add_trace(go.Scatter(x=osu_lat_df_3['Size'], y=osu_lat_df_3['Latency(us)'], line=dict(color='firebrick')), row=1, col=2)

fig.update_layout(showlegend=False, title_text="Different socket same node")
fig.update_xaxes(title_text='Size')
fig.update_yaxes(title_text='Bandwidth(MB/s)', row=1, col=1)
fig.update_yaxes(title_text='Latency(us)', row=1, col=2)

fig.show()
pio.write_image(fig, DIFF_SOCKET_SA_NO+"/multiple-lines.png")