# Scaling Analysis

Author: Brain Gravelle (gravelle@cs.uoregon.edu)


All this is using the taucmdr python libraries from paratools
http://taucommander.paratools.com/


## Imports
This section imports necessary libraies, the metrics.py and utilities.py files and sets up the window.


<a id='top'></a>

In [1]:
# A couple of scripts to set the environent and import data from a .tau set of results
from utilities import *
from metrics import *
# Plotting, notebook settings:
%matplotlib inline  
#plt.rcParams.update({'font.size': 16})
import numbers
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.float_format', lambda x: '%.2e' % x)
pd.set_option('display.max_columns',100)
pd.set_option('max_colwidth', 70)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import copy

## Getting Data

TAU Commander uses TAU to run the application and measure it using runtime sampling techniques (similar to Intel VTune). Many customization options are available. For example, we may consider each function regardless of calling context, or we may decide to enable callpath profiling to see each context separately.

From the talapas_scaling application the following experiments are available. These use Talapas (with 28 thread Broadwell processors) and the build-ce (realistic) option for mkFit. The first six experiments use the --num-thr option to set the thread count which is intended to perform threading within the events. the last two add the --num-ev-thr option to set the event threads, so that all threads are used to process events in parallel and each event is processed by a single thread. 
* manual_scaling_Large_talapas		
* manual_scaling_Large_talapas_fullnode	
* manual_scaling_TTbar70_talapas		
* manual_scaling_TTbar70_talapas_fullnode
* manual_scaling_TTbar35_talapas
* manual_scaling_TTbar35_talapas_fullnode
* ev_thr_scaling_Large_talapas
* ev_thr_scaling_Large_talapas_fullnode

Additionally available in the cori_scaling application are the following. These were run on NERSC's Cori on the KNL with the default memory settings (quad - 1 NUMA domain, cache - MCDRAM as direct mapped cache). See http://www.nersc.gov/users/computational-systems/cori/running-jobs/advanced-running-jobs-options/ for more info on the KNL modes. Similar to the talapas scaling they use the build-ce option and threading within each event.
* manual_scaling_TTbar35
* manual_scaling_TTbar70
* manual_scaling_Large
* mixed_thr_scaling_Large - this is bad


### Importing Scaling Data - Cori TTbar70 is current
Here we import the data. In this case we are using Cori data from the experiments with the threads working within each event using the TTbar70 file. Note that this box will take an hour or more to run; please go enjoy a coffee while you wait.

In [2]:
# application = "talapas_scaling"
# experiment  = "manual_scaling_TTbar70_talapas"
# experiment  = "manual_scaling_Large_talapas"
# experiment = "ev_thr_scaling_Large_talapas"

application = "cori_scaling"
# experiment  = "manual_scaling_TTbar35"
experiment  = "manual_scaling_TTbar70"
# experiment  = "manual_scaling_Large"
# experiment  = "mixed_thr_scaling_Large"

path = ".tau/" + application + "/" + experiment + "/"
# note that this function takes a long time to run, so only rerun if you must

metric_data = get_pandas_scaling(path, callpaths=True)
    
if application == "talapas_scaling":
    metric_data = remove_erroneous_threads(metric_data,  [1, 8, 16, 32, 48, 56])
elif application == "cori_scaling":
    print(metric_data.keys())
    metric_data = remove_erroneous_threads(metric_data,  [1, 4, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256])

Parsing ERROR: 
dir = .tau/cori_scaling/manual_scaling_TTbar70/0_208_61//MULTI__PAPI_LST_INS
Found: 1101 trials with 10 errors


[256, 64, 1, 8, 128, 16, 18, 32, 112, 34, 176, 48, 192, 160, 96, 80, 82, 224, 144, 226, 208, 146, 272, 240]


## A list of metrics

In [3]:
print_available_metrics(metric_data,True)

for key in metric_data[metric_data.keys()[5]]:
    if not key == 'METADATA':
        print(key)
print(metric_data.keys())

PAPI_BR_INS
PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD
PAPI_L2_TCA
PAPI_NATIVE_LLC_MISSES
PAPI_TLB_DM
PAPI_NATIVE_LLC_REFERENCES
PAPI_RES_STL
PAPI_L2_TCM
PAPI_TOT_INS
PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD
PAPI_NATIVE_FETCH_STALL
PAPI_LST_INS
PAPI_BR_UCN
PAPI_NATIVE_RS_FULL_STALL
PAPI_BR_CN
PAPI_L1_TCM
PAPI_BR_MSP
PAPI_TOT_CYC
PAPI_BR_INS
PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD
PAPI_L2_TCA
PAPI_NATIVE_LLC_MISSES
PAPI_TLB_DM
PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD
PAPI_RES_STL
PAPI_L2_TCM
PAPI_TOT_INS
PAPI_BR_UCN
PAPI_NATIVE_FETCH_STALL
PAPI_LST_INS
PAPI_BR_CN
PAPI_NATIVE_RS_FULL_STALL
PAPI_NATIVE_LLC_REFERENCES
PAPI_L1_TCM
PAPI_BR_MSP
PAPI_TOT_CYC
[256, 64, 240, 32, 144, 1, 8, 112, 128, 176, 192, 224, 96, 16, 80, 48, 160, 208]


## Adding metrics

metrics are available in metrics.py. At this time the following can be added:
* add_IPC(metrics)          - Instructions per Cycle
* add_CPI(metrics)          - Cycles per instruction
* add_VIPC(metrics)         - vector instructions per cycle
* add_VIPI(metrics)         - vector instructions per instruction (i.e. fraction of total)
* add_L1_missrate(metrics)  - miss rate for L1 cache

for scaling data please use the add_metric_to_scaling_data(data, metric_func) function to add a metric

Here we add some predeefined metrics and print the top 10 functions with the best IPC

In [4]:
add_metric_to_scaling_data(metric_data, add_CPI)
add_metric_to_scaling_data(metric_data, add_IPC)
add_metric_to_scaling_data(metric_data, add_L1_missrate)
add_metric_to_scaling_data(metric_data, add_L2_missrate)
add_metric_to_scaling_data(metric_data, add_VIPI)
if application == 'cori_scaling': llc = True
else: llc = False
add_metric_to_scaling_data(metric_data, add_L3_missrate, llc)
print_available_metrics(metric_data, scaling=True)

add_metric_to_scaling_data(metric_data, add_DERIVED_BRANCH_MR)
add_metric_to_scaling_data(metric_data, add_DERIVED_RATIO_FETCH_STL_TOT_CYC)

# To test
# metric_data[1]['DERIVED_IPC'].sort_values(by='Inclusive',ascending=False).head(10)

DERIVED_VIPI
PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD
PAPI_L2_TCA
PAPI_NATIVE_LLC_MISSES
PAPI_TLB_DM
PAPI_L2_TCM
PAPI_NATIVE_FETCH_STALL
PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD
DERIVED_CPI
PAPI_L1_TCM
PAPI_BR_MSP
PAPI_RES_STL
PAPI_TOT_INS
PAPI_BR_CN
DERIVED_L1_MISSRATE
DERIVED_L3_MISSRATE
PAPI_BR_UCN
PAPI_NATIVE_LLC_REFERENCES
PAPI_BR_INS
DERIVED_L2_MISSRATE
DERIVED_IPC
PAPI_LST_INS
PAPI_NATIVE_RS_FULL_STALL
PAPI_TOT_CYC


#### Combining metrics

In [7]:
THREAD_COUNT = 32

alldata = combine_metrics(metric_data[THREAD_COUNT],inc_exc='Exclusive')

## Scaling Results

In this section we carefully walk through an analysis of the application to find areas of interest.

We begin by looking at correlations of data to determine metrics of interest and then move on to ploting those metric. In this analysis we primarily use PAPI_TOT_CYC as a proxy for the time it takes a function to complete.

## Correlations

In [22]:
cm = sns.light_palette("yellow", as_cmap=True)

def print_corr(alldata, method='pearson', metrics=['PAPI_TOT_CYC', 'PAPI_TOT_INS']):
    correlations = alldata.corr(method).fillna(0)[metrics]    # Other methods: 'kendall', 'spearman'
    return correlations.style.format("{:.2%}").background_gradient(cmap=cm)
    
print_corr(alldata)

Unnamed: 0,PAPI_TOT_CYC,PAPI_TOT_INS
PAPI_TOT_CYC,100.00%,60.17%
DERIVED_VIPI,0.42%,-11.67%
PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD,20.97%,74.32%
PAPI_L2_TCA,75.23%,92.60%
PAPI_NATIVE_LLC_MISSES,78.18%,85.12%
PAPI_TLB_DM,54.52%,83.70%
PAPI_L2_TCM,75.58%,85.70%
PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD,78.75%,74.12%
PAPI_NATIVE_FETCH_STALL,86.78%,72.38%
PAPI_L1_TCM,74.25%,91.75%


In [18]:
print_corr(alldata, method='kendall')

Unnamed: 0,PAPI_TOT_CYC,PAPI_TOT_INS
PAPI_TOT_CYC,100.00%,50.20%
DERIVED_VIPI,18.42%,3.32%
PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD,41.58%,67.86%
PAPI_L2_TCA,49.19%,73.51%
PAPI_NATIVE_LLC_MISSES,46.69%,73.14%
PAPI_TLB_DM,43.82%,66.73%
PAPI_L2_TCM,51.83%,78.03%
PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD,43.89%,63.00%
PAPI_NATIVE_FETCH_STALL,53.47%,69.59%
PAPI_L1_TCM,51.80%,78.28%


In [19]:
print_corr(alldata, method='spearman')

Unnamed: 0,PAPI_TOT_CYC,PAPI_TOT_INS
PAPI_TOT_CYC,100.00%,62.88%
DERIVED_VIPI,28.06%,5.99%
PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD,52.84%,83.14%
PAPI_L2_TCA,62.47%,88.83%
PAPI_NATIVE_LLC_MISSES,59.92%,88.35%
PAPI_TLB_DM,56.99%,83.30%
PAPI_L2_TCM,64.51%,91.77%
PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD,58.31%,79.25%
PAPI_NATIVE_FETCH_STALL,68.44%,85.91%
PAPI_L1_TCM,64.84%,92.15%
