# APMA2822B Homework 1 - Hammad Izhar + Robert Scheidegger

#TODO: Intro text (HAMMAD)

In [38]:
# Initial block: Load the CSV file from the data folder (oscar_data.csv)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('../data/oscar_data.csv', encoding='latin-1')
df['flops'] = 2 * df['n'] * df['n'] / df['time_us'] * 1e6
df['gflops'] = df['flops'] / 1e9
df['iops'] = 5 * df['n'] * df['m'] / df['time_us'] * 1e6



## Roof-line Analysis

For the purposes of this analyis, we will look at the results from the `DisjointMemoryAllocator`, `RowColumnMultiplier`, and size `100000 x 10000` (the largest of the sizes that we have tested for this assignment).

We see that the code for the `RowColumnMultiplier` is as follows (exluding timing and parallelization primitives; the exact code can be found in `include/multipliers.hpp`):

```
for (uint32_t i = 0; i < n; i++) {
    for (uint32_t j = 0; j < m; j++) {
        output[i] += matrix[i][j] * vector[j];
    }
}
```

To compute the arithmetic intensity, we first consider all of the IOps:

1. Load `matrix[i]` (since this is a pointer, and is needed to find the column value)
2. Load `matrix[i][j]`
3. Load `vector[j]`
4. Save `matrix[i][j] * vector[j]` into a temporary variable
5. Load `output[i]` (can add this to the temporary variable)
6. Save `output[i]`

This gives 6 input/output operations, which each act on 4 bytes (since these are all `float` values), giving a total of `24` bytes transferred. Further, the floating point operations are:

1. Multiply `RESULT = matrix[i][j] * vector[j]`
2. Add `RESULT + output[i]`

This gives an overall arithmetic intensity (AI) of `AI = 2 / 24 = 1 / 12`. By the example numbers given we should expect the limiting point of the roof-line plot to be at `10^12 / (160 * 10^9) = 6.25`, so clearly we are within the memory bandwidth limiting section of the roof-line plot. Thus, we should expect a peak GFLOPs rate of `1000 * (1/12) = 83.3` GFLOPs. From the table below, you can see that out actual peak occurred at 110 GFLOPs when using 32 threads to compute the result, which is certainly in line with the expectation (especially since the TFLOPs estimate is non sync'd to the OSCAR hardware we are actually running on).

This can be seen in the table and actual roof-line plot below:

In [41]:
# Compute the roof-line plot for the analysis above.
subset = df[df['allocator'] == 'ContiguousMemoryAllocator']
subset = subset[subset['multiplier'] == 'RowColumnMultiplier']
subset = subset[subset['m'] == 10000]
subset = subset[subset['n'] == 100000]

# Plot the roof-line plot
# TODO: Hammad

subset



Unnamed: 0,n,m,threads,allocator,multiplier,iterations,time_us,stdev_us,flops,gflops,iops
1908,100000,10000,1,ContiguousMemoryAllocator,RowColumnMultiplier,10,2734335.0,,7314393000.0,7.314393,1828598000.0
1916,100000,10000,2,ContiguousMemoryAllocator,RowColumnMultiplier,10,1361415.0,512.0,14690600000.0,14.6906,3672650000.0
1924,100000,10000,4,ContiguousMemoryAllocator,RowColumnMultiplier,10,682673.5,0.0,29296580000.0,29.296582,7324145000.0
1932,100000,10000,8,ContiguousMemoryAllocator,RowColumnMultiplier,10,342263.9,90.509666,58434440000.0,58.434441,14608610000.0
1940,100000,10000,16,ContiguousMemoryAllocator,RowColumnMultiplier,10,170800.0,169.328079,117096000000.0,117.096019,29274000000.0
1948,100000,10000,32,ContiguousMemoryAllocator,RowColumnMultiplier,10,178243.2,9558.630859,112206200000.0,112.206242,28051560000.0
1956,100000,10000,64,ContiguousMemoryAllocator,RowColumnMultiplier,10,201112.4,5523.685547,99446870000.0,99.446873,24861720000.0
1964,100000,10000,1,ContiguousMemoryAllocator,RowColumnMultiplier,10,2730175.0,,7325537000.0,7.325537,1831384000.0
1972,100000,10000,2,ContiguousMemoryAllocator,RowColumnMultiplier,10,1365162.0,362.038666,14650270000.0,14.650271,3662568000.0
1980,100000,10000,4,ContiguousMemoryAllocator,RowColumnMultiplier,10,682901.7,256.0,29286790000.0,29.286792,7321698000.0


## Performance Analysis

To test our

This leads to a total of 2016 possible combinations of parameters, and for each of these we performed an experiment on oscar (in a single script, which iterated through all of the possible benchmark configurations). A warmup/dummy computation was added to each  was added 

To ensure consistency within a run, each was repeated for `10` iterations, and the mean and standard deviations of the runtimes were computed. A sample of the data that we collected is seen below:

In [31]:
df
df[df['time_us'] != 0].sort_values('flops', ascending=False)

Unnamed: 0,n,m,threads,allocator,multiplier,iterations,time_us,stdev_us,flops,gflops,iops
1708,100000,1,8,ContiguousMemoryAllocator,RowColumnMultiplier,10,50.099998,2.022457,3.992016e+14,3.992016e+05,9.980040e+09
1706,100000,1,8,DisjointRowMemoryAllocator,RowColumnMultiplier,10,85.500000,10.452272,2.339181e+14,2.339181e+05,5.847953e+09
1700,100000,1,4,ContiguousMemoryAllocator,RowColumnMultiplier,10,93.800003,2.357874,2.132196e+14,2.132196e+05,5.330490e+09
1704,100000,1,8,DisjointMemoryAllocator,RowColumnMultiplier,10,110.000000,35.437267,1.818182e+14,1.818182e+05,4.545455e+09
1724,100000,1,32,ContiguousMemoryAllocator,RowColumnMultiplier,10,126.199997,26.049194,1.584786e+14,1.584786e+05,3.961965e+09
...,...,...,...,...,...,...,...,...,...,...,...
91,1,10,16,DisjointRowMemoryAllocator,ColumnRowMultiplier,10,8447.700195,1202.128174,2.367508e+02,2.367508e-07,5.918771e+03
313,1,10000,16,DisjointMemoryAllocator,ColumnRowMultiplier,10,8459.099609,1211.121826,2.364318e+02,2.364318e-07,5.910795e+06
262,1,10000,16,MmapMemoryAllocator,RowColumnMultiplier,10,8461.599609,1215.756592,2.363619e+02,2.363619e-07,5.909048e+06
256,1,10000,16,DisjointMemoryAllocator,RowColumnMultiplier,10,8489.299805,1232.574585,2.355907e+02,2.355907e-07,5.889767e+06
