# Scaling Analysis

Author: Brain Gravelle (gravelle@cs.uoregon.edu)


All this is using the taucmdr python libraries from paratools
http://taucommander.paratools.com/


## Imports
This section imports necessary libraies, the metrics.py and utilities.py files and sets up the window.


<a id='top'></a>

In [1]:
# A couple of scripts to set the environent and import data from a .tau set of results
from utilities import *
from metrics import *
# Plotting, notebook settings:
%matplotlib inline  
#plt.rcParams.update({'font.size': 16})
import numbers
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
pd.set_option('display.float_format', lambda x: '%.2e' % x)
pd.set_option('display.max_columns',100)
pd.set_option('max_colwidth', 70)

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"



<a href='#tot_cyc'>tot cyc</a><br>
<a href='#l1_mr'>l1_mr</a><br>
<a href='#l2_mr'>l2_mr</a><br>
<a href='#l3_mr'>l3_mr</a><br>
<a href='#branch_mr'>branch_mr</a><br>
<a href='#fetch_stall'>fetch_stall</a><br>
<a href='#vipi'>vipi</a><br>

## Getting Data

TAU Commander uses TAU to run the application and measure it using runtime sampling techniques (similar to Intel VTune). Many customization options are available. For example, we may consider each function regardless of calling context, or we may decide to enable callpath profiling to see each context separately.

From the talapas_scaling application the following experiments are available. These use Talapas (with 28 thread Broadwell processors) and the build-ce (realistic) option for mkFit. The first six experiments use the --num-thr option to set the thread count which is intended to perform threading within the events. the last two add the --num-ev-thr option to set the event threads, so that all threads are used to process events in parallel and each event is processed by a single thread. 
* manual_scaling_Large_talapas		
* manual_scaling_Large_talapas_fullnode	
* manual_scaling_TTbar70_talapas		
* manual_scaling_TTbar70_talapas_fullnode
* manual_scaling_TTbar35_talapas
* manual_scaling_TTbar35_talapas_fullnode
* ev_thr_scaling_Large_talapas
* ev_thr_scaling_Large_talapas_fullnode

Additionally available in the cori_scaling application are the following. These were run on NERSC's Cori on the KNL with the default memory settings (quad - 1 NUMA domain, cache - MCDRAM as direct mapped cache). See http://www.nersc.gov/users/computational-systems/cori/running-jobs/advanced-running-jobs-options/ for more info on the KNL modes. Similar to the talapas scaling they use the build-ce option and threading within each event.
* manual_scaling_TTbar35
* manual_scaling_TTbar70
* manual_scaling_Large
* mixed_thr_scaling_Large - this is bad


### Importing Scaling Data - Cori TTbar70 is current
Here we import the data. In this case we are using Cori data from the experiments with the threads working within each event using the TTbar35 file. Note that this box will take an hour or more to run; please go enjoy a coffee while you wait.

In [None]:
# application = "talapas_scaling"
# experiment  = "manual_scaling_TTbar70_talapas"
# experiment  = "manual_scaling_Large_talapas"
# experiment = "ev_thr_scaling_Large_talapas"

application = "cori_scaling"
# experiment  = "manual_scaling_TTbar35"
experiment  = "manual_scaling_TTbar70"
# experiment  = "manual_scaling_Large"
# experiment  = "mixed_thr_scaling_Large"

path = ".tau/" + application + "/" + experiment + "/"
# note that this function takes a long time to run, so only rerun if you must

metric_data = get_pandas_scaling(path, callpaths=True)
    
if application == "talapas_scaling":
    metric_data = remove_erroneous_threads(metric_data,  [1, 8, 16, 32, 48, 56])
elif application == "cori_scaling":
    print(metric_data.keys())
    metric_data = remove_erroneous_threads(metric_data,  [1, 4, 8, 16, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256])

In [None]:
# metric_data[8]['PAPI_TOT_CYC'].head(10)

## A list of metrics

In [None]:
print_available_metrics(metric_data,True)

for key in metric_data[metric_data.keys()[5]]:
    if not key == 'METADATA':
        print(key)
print(metric_data.keys())

#### Metric metadata

In [None]:
# print_metadata(metric_data[8])

## Adding metrics

metrics are available in metrics.py. At this time the following can be added:
* add_IPC(metrics)          - Instructions per Cycle
* add_CPI(metrics)          - Cycles per instruction
* add_VIPC(metrics)         - vector instructions per cycle
* add_VIPI(metrics)         - vector instructions per instruction (i.e. fraction of total)
* add_L1_missrate(metrics)  - miss rate for L1 cache

for scaling data please use the add_metric_to_scaling_data(data, metric_func) function to add a metric

Here we add some predeefined metrics and print the top 10 functions with the best IPC

In [None]:
add_metric_to_scaling_data(metric_data, add_CPI)
add_metric_to_scaling_data(metric_data, add_IPC)
add_metric_to_scaling_data(metric_data, add_L1_missrate)
add_metric_to_scaling_data(metric_data, add_L2_missrate)
add_metric_to_scaling_data(metric_data, add_VIPI)
if application == 'cori_scaling': llc = True
else: llc = False
add_metric_to_scaling_data(metric_data, add_L3_missrate, llc)
print_available_metrics(metric_data, scaling=True)

# metric_data[1]['DERIVED_IPC'].sort_values(by='Inclusive',ascending=False).head(10)

In [None]:
def add_DERIVED_BRANCH_MR(metrics):
    if (not metrics.has_key('PAPI_BR_MSP')):
        print 'ERROR adding DERIVED_BRANCH_MR to metric dictionary'
        return False
    a0 = metrics['PAPI_BR_MSP'].copy()
    a0.index = a0.index.droplevel()
    u0 = a0.unstack()
    if (not metrics.has_key('PAPI_BR_CN')):
        print 'ERROR adding DERIVED_BRANCH_MR to metric dictionary'
        return False
    a1 = metrics['PAPI_BR_CN'].copy()
    a1.index = a1.index.droplevel()
    u1 = a1.unstack()
    metrics['DERIVED_BRANCH_MR'] = a0 / a1

    return True

def add_DERIVED_RATIO_FETCH_STL_TOT_CYC(metrics):
    if (not metrics.has_key('PAPI_NATIVE_FETCH_STALL')):
        print 'ERROR adding DERIVED_BRANCH_MR to metric dictionary'
        return False
    a0 = metrics['PAPI_BR_MSP'].copy()
    a0.index = a0.index.droplevel()
    u0 = a0.unstack()
    if (not metrics.has_key('PAPI_TOT_CYC')):
        print 'ERROR adding DERIVED_BRANCH_MR to metric dictionary'
        return False
    a1 = metrics['PAPI_BR_CN'].copy()
    a1.index = a1.index.droplevel()
    u1 = a1.unstack()
    metrics['DERIVED_RATIO_FETCH_STL_TOT_CYC'] = a0 / a1

    return True



add_metric_to_scaling_data(metric_data, add_DERIVED_BRANCH_MR)
add_metric_to_scaling_data(metric_data, add_DERIVED_RATIO_FETCH_STL_TOT_CYC)


# metric_data[1]['DERIVED_BRANCH_MR'].sort_values(by='Inclusive',ascending=False).head(10)

## Scaling Results

In this section we demo some scaling results with several different metrics.

We use the scaling plot function to plot the data vs thread count.
scaling_plot(data, inclusive=True, plot=True, function="\[SUMMARY\] .TAU application$", metric='PAPI_TOT_CYC', max=False)
* data = the full dictionary of scaling data 
* inclusive = determines if the inclusive data or exclusive data will be used
* plot = true makes a plot false does not
* function = the string that will be searched for to plot. Default looks at the whole application
* metric = the metric choosen from the above list
* max = use the max value or average value across the threads

## Scaling with total cycles vs the thread count
Here we plot the cycle count for each thread count as a proxy for execution time. We use the max cycle count rather than the average as this number will limit the time of execution.

<a id='tot_cyc'></a>

<a href='#top'>top</a><br>
<a href='#tot_cyc'>tot cyc</a><br>
<a href='#l1_mr'>l1_mr</a><br>
<a href='#l2_mr'>l2_mr</a><br>
<a href='#l3_mr'>l3_mr</a><br>
<a href='#branch_mr'>branch_mr</a><br>
<a href='#fetch_stall'>fetch_stall</a><br>
<a href='#vipi'>vipi</a><br>

In [None]:
thread_list, tot_cyc_list = scaling_plot(metric_data, function="\[SUMMARY\] .TAU application$", max=True)

In [None]:
THREAD_COUNT = 1
# func = 'clean_cms_seedtracks'
func = 'NULL'
cyc_data = select_metric_from_scaling(metric_data, 'PAPI_TOT_CYC')
# get_func_level_metric(cyc_data[THREAD_COUNT], avg=True, func="LayerOfHits::").head(20)
get_func_level_metric(cyc_data[THREAD_COUNT], avg=True, func=func).head(20)

In [None]:
THREAD_COUNT = 8
func = 'NULL'
# func = 'helixAtRFromIterativeCCS'
# func = 'MultHelixPropTransp'
# func = '.TAU application$'
cyc_data = select_metric_from_scaling(metric_data, 'PAPI_TOT_CYC')
get_func_level_metric(cyc_data[THREAD_COUNT], avg=True, func=func).head(20)


# bottom of top ten = 2.1e8
# total is 4.4e9

In [None]:
THREAD_COUNT = 48
# func = 'FindCandidatesCloneEngine'
func = 'NULL'
cyc_data = select_metric_from_scaling(metric_data, 'PAPI_TOT_CYC')
get_func_level_metric(cyc_data[THREAD_COUNT], avg=True, func = func).head(30)


### Cycles per thread for each thread count
Here we show load balancing with a series of plots showing the cycle count per thread. We have one plot for each thread count used

In [None]:
thread_cyc_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_TOT_CYC')

for kt in thread_list:
    print kt
    data = list(thread_cyc_data[kt])
    matplotlib.pyplot.bar(range(len(data)), data)
    matplotlib.pyplot.ylim(ymax=50000000000) 
    matplotlib.pyplot.show()

## L1 Missrate vs thread count
Similar to above these cells show the L1 missrates. In this case we want to get the plotting data for L1 acceses and misses but comupte the miss rate before plotting, so we set plot=False

<a href='#top'>top</a><br>

<a id='l1_mr'></a>

In [None]:
thread_list, L1A_data = scaling_plot(metric_data, plot=False, metric='PAPI_LST_INS')
thread_list, L1M_data = scaling_plot(metric_data, plot=False, metric='PAPI_L1_TCM')
    
L1_MR_list = [L1M_data[i] / L1A_data[i] for i in range(len(thread_list))]

plt = matplotlib.pyplot.plot(thread_list, L1_MR_list)

### L1 Miss rate by each thread of each thread count

In [None]:
thread_L1A_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_LST_INS')
thread_L1M_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_L1_TCM')

MR_data = {}
for kt in thread_list:
#     print(thread_L1M_data[kt])
#     print(thread_L1A_data[kt])
    MR_data[kt] = thread_L1M_data[kt] / thread_L1A_data[kt]
    
for kt in thread_list:
    print kt
    data = list(MR_data[kt])
    matplotlib.pyplot.bar(range(len(data)), data)
    matplotlib.pyplot.ylim(ymax=0.05)
    matplotlib.pyplot.show()

### L1 Top 10 bad miss rates

In [None]:
L1_data = select_metric_from_scaling(metric_data, 'DERIVED_L1_MISSRATE')
L1_MR_dict = {}
for n_thr in thread_list:
    L1_MR_dict[n_thr] = filter_libs_out(L1_data[n_thr]).sort_values(by='Exclusive',ascending=False)[["Exclusive"]]
print thread_list

In [None]:
THREAD_COUNT = 1
# func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(L1_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8

get_func_level_metric(L1_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(L1_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

### L1 Top 10 bad miss counts

In [None]:
L1_tcm = select_metric_from_scaling(metric_data, 'PAPI_L1_TCM')
L1_tcm_dict = {}
for n_thr in thread_list:
    L1_tcm_dict[n_thr] = filter_libs_out(L1_tcm[n_thr])
print thread_list

In [None]:
THREAD_COUNT = 1
# func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(L1_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L1_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(L1_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

## L2 Missrate vs thread count
Similar to above these cells show the L2 missrates.

<a href='#top'>top</a><br>
<a id='l2_mr'></a>

In [None]:
thread_list, L2A_data = scaling_plot(metric_data, plot=False, metric='PAPI_L2_TCA')
thread_list, L2M_data = scaling_plot(metric_data, plot=False, metric='PAPI_L2_TCM')
    
L2_MR_list = [L2M_data[i] / L2A_data[i] for i in range(len(thread_list))]

plt = matplotlib.pyplot.plot(thread_list, L2_MR_list)

In [None]:
thread_L2A_data = get_thread_level_metric_scaling(select_metric_from_scaling(metric_data, 'PAPI_L2_TCA'))
thread_L2M_data = get_thread_level_metric_scaling(select_metric_from_scaling(metric_data, 'PAPI_L2_TCM'))


L2_MR_data = {}
for kt in thread_list:
    L2_MR_data[kt] = thread_L2M_data[kt] / thread_L2A_data[kt]
    
for kt in thread_list:
    print kt
    data = list(L2_MR_data[kt])
    matplotlib.pyplot.bar(range(len(data)), data)
    matplotlib.pyplot.ylim(ymax=0.3)
    matplotlib.pyplot.show()

### L2 Top 10 bad miss rates

In [None]:
L2_data = select_metric_from_scaling(metric_data, 'DERIVED_L2_MISSRATE')
L2_MR_dict = {}
for n_thr in thread_list:
    L2_MR_dict[n_thr] = filter_libs_out(L2_data[n_thr]).sort_values(by='Exclusive',ascending=False)[["Exclusive"]]
print thread_list

In [None]:
THREAD_COUNT = 8
# func = 'NULL'
func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(L1_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L2_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L3_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

### ALL CACHE Top 10 bad miss counts

In [None]:
L2_tcm = select_metric_from_scaling(metric_data, 'PAPI_L2_TCM')
L2_tcm_dict = {}
for n_thr in thread_list:
    L2_tcm_dict[n_thr] = filter_libs_out(L2_tcm[n_thr])
print thread_list

In [None]:
THREAD_COUNT = 8
# func = 'NULL'
func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(L1_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L2_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L3_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

#### Total Cache misses

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L1_tcm_dict[THREAD_COUNT], func='.TAU application$', inclusive='inclusive', avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L2_tcm_dict[THREAD_COUNT], func='.TAU application$', inclusive='inclusive', avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L3_tcm_dict[THREAD_COUNT], func='.TAU application$', inclusive='inclusive', avg=True).head(20)

## L3 Missrate vs thread count
Similar to above these cells show the L3 missrates.


<a href='#top'>top</a><br>
<a id='l3_mr'></a>

In [None]:
if application == 'talapas_scaling':
    thread_list, LLA_data = scaling_plot(metric_data, plot=False, metric='PAPI_L3_TCA')
    thread_list, LLM_data = scaling_plot(metric_data, plot=False, metric='PAPI_L3_TCM')
else:
    thread_list, LLA_data = scaling_plot(metric_data, plot=False, metric='PAPI_NATIVE_LLC_REFERENCES')
    thread_list, LLM_data = scaling_plot(metric_data, plot=False, metric='PAPI_NATIVE_LLC_MISSES')
    
LL_MR_list = [LLM_data[i] / LLA_data[i] for i in range(len(thread_list))]

plt = matplotlib.pyplot.plot(thread_list, LL_MR_list)

In [None]:
if application == 'talapas_scaling':
    thread_LLA_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_L3_TCA')
    thread_LLM_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_L3_TCM')
else:
    thread_LLA_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_NATIVE_LLC_REFERENCES')
    thread_LLM_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_NATIVE_LLC_MISSES')


LL_MR_data = {}
for kt in thread_list:
    LL_MR_data[kt] = thread_LLM_data[kt] / thread_LLA_data[kt]

def thread_bar_plots(data_dict, t_list, y=-1):
    for kt in t_list:
        print "Thread Count: %d" % kt
        data = list(data_dict[kt])
        matplotlib.pyplot.bar(range(len(data)), data)
        if y != -1:
            matplotlib.pyplot.ylim(ymax=y)
        matplotlib.pyplot.show()

thread_bar_plots(LL_MR_data, thread_list, 0.2)


### L3 Top 10 bad miss rates

In [None]:
L3_data = select_metric_from_scaling(metric_data, 'DERIVED_L3_MISSRATE')
L3_MR_dict = {}
for n_thr in thread_list:
    L3_MR_dict[n_thr] = filter_libs_out(L3_data[n_thr]).sort_values(by='Exclusive',ascending=False)[["Exclusive"]]
print thread_list

In [None]:
THREAD_COUNT = 1
# func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(L3_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L3_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(L3_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

### L3 Top 10 bad miss counts

In [None]:
L3_tcm = select_metric_from_scaling(metric_data, 'PAPI_NATIVE_LLC_MISSES')
L3_tcm_dict = {}
for n_thr in thread_list:
    L3_tcm_dict[n_thr] = filter_libs_out(L1_tcm[n_thr])
print thread_list

In [None]:
THREAD_COUNT = 1
# func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(L3_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(L3_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(L3_tcm_dict[THREAD_COUNT], func=func, avg=True).head(20)

## Branch Missrate vs thread count
Similar to above these cells show the rate of misspredicted branches.


<a href='#top'>top</a><br>
<a id='branch_mr'></a>

In [None]:
thread_list, BR_INS_data = scaling_plot(metric_data, plot=False, metric='PAPI_BR_INS')
thread_list, BR_MSP_data = scaling_plot(metric_data, plot=False, metric='PAPI_BR_MSP')
    
BR_MR_list = [BR_MSP_data[i] / BR_INS_data[i] for i in range(len(thread_list))]

plt = matplotlib.pyplot.plot(thread_list, BR_MR_list)

### Branch Miss rate by each thread of each thread count

In [None]:
thread_BR_INS_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_BR_INS')
thread_BR_MSP_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_BR_MSP')

MR_data = {}
for kt in thread_list:
#     print(thread_L1M_data[kt])
#     print(thread_L1A_data[kt])
    MR_data[kt] = thread_BR_MSP_data[kt] / thread_BR_INS_data[kt]
    
for kt in thread_list:
    print kt
    data = list(MR_data[kt])
    matplotlib.pyplot.bar(range(len(data)), data)
    matplotlib.pyplot.ylim(ymax=0.2)
    matplotlib.pyplot.show()

### Branch Top 10 bad miss rates

In [None]:
BR_data = select_metric_from_scaling(metric_data, 'DERIVED_BRANCH_MR')
BR_MR_dict = {}
for n_thr in thread_list:
    BR_MR_dict[n_thr] = filter_libs_out(BR_data[n_thr]).sort_values(by='Exclusive',ascending=False)[["Exclusive"]]
print thread_list

In [None]:
THREAD_COUNT = 1
func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(BR_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
func = 'NULL'
get_func_level_metric(BR_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(BR_MR_dict[THREAD_COUNT], func=func, avg=True).head(20)

### Branch Top 10 bad miss counts

In [None]:
BR_msp = select_metric_from_scaling(metric_data, 'PAPI_BR_INS')
# BR_msp = select_metric_from_scaling(metric_data, 'PAPI_BR_MSP')
BR_msp_dict = {}
for n_thr in thread_list:
    BR_msp_dict[n_thr] = filter_libs_out(BR_msp[n_thr])
print thread_list

In [None]:
THREAD_COUNT = 1
func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(BR_msp_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(BR_msp_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
func = 'NULL'
# func = 'MultHelixPropTransp'
get_func_level_metric(BR_msp_dict[THREAD_COUNT], func=func, avg=True).head(20)

## Fetch Stall Top 10
Includes fetch stalls per tot cyc
and raw counts


<a href='#top'>top</a><br>
<a id='fetch_stall'></a>

In [None]:
thread_list, FETCH_STALL_data = scaling_plot(metric_data, plot=False, metric='PAPI_NATIVE_FETCH_STALL')
thread_list, TOT_CYC_data = scaling_plot(metric_data, plot=False, metric='PAPI_TOT_CYC')
    
FS_P_CYC_list = [FETCH_STALL_data[i] / TOT_CYC_data[i] for i in range(len(thread_list))]

plt = matplotlib.pyplot.plot(thread_list, FS_P_CYC_list)

### fetch stalls per total cyc by each thread of each thread count

In [None]:
thread_TOT_CYC_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_TOT_CYC')
thread_FETCH_STL_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_NATIVE_FETCH_STALL')

FS_P_CYC_data = {}
for kt in thread_list:
#     print(thread_L1M_data[kt])
#     print(thread_L1A_data[kt])
    FS_P_CYC_data[kt] = thread_FETCH_STL_data[kt] / thread_TOT_CYC_data[kt]
    
for kt in thread_list:
    print kt
    data = list(FS_P_CYC_data[kt])
    matplotlib.pyplot.bar(range(len(data)), data)
    matplotlib.pyplot.ylim(ymax=0.2)
    matplotlib.pyplot.show()

### Fetch stall per cyc top ten bad

In [None]:
FSTC_data = select_metric_from_scaling(metric_data, 'DERIVED_RATIO_FETCH_STL_TOT_CYC')
FSTC_dict = {}
for n_thr in thread_list:
    FSTC_dict[n_thr] = filter_libs_out(FSTC_data[n_thr]).sort_values(by='Exclusive',ascending=False)[["Exclusive"]]
print thread_list

In [None]:
THREAD_COUNT = 1
func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(FSTC_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8

get_func_level_metric(FSTC_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(FSTC_dict[THREAD_COUNT], func=func, avg=True).head(20)

### Fetch Stall count top 10 lists

In [None]:
fetch_stall = select_metric_from_scaling(metric_data, 'PAPI_NATIVE_FETCH_STALL')
fetch_stall_dict = {}
for n_thr in thread_list:
    fetch_stall_dict[n_thr] = filter_libs_out(fetch_stall[n_thr])
print thread_list

In [None]:
THREAD_COUNT = 1
func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(fetch_stall_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(fetch_stall_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(fetch_stall_dict[THREAD_COUNT], func=func, avg=True).head(20)

## Vector Ins per Ins Top 10 counts


<a href='#top'>top</a><br>
<a id='vipi'></a>

In [None]:
vector_ratio = select_metric_from_scaling(metric_data, 'DERIVED_VIPI')
vector_ratio_dict = {}
for n_thr in thread_list:
    vector_ratio_dict[n_thr] = filter_libs_out(vector_ratio[n_thr])
print thread_list

In [None]:
THREAD_COUNT = 1
func = 'NULL'
# func = 'FindCandidatesCloneEngine' # NULL for nothing
# func = 'SelectHitIndices' # NULL for nothing
get_func_level_metric(vector_ratio_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 8
get_func_level_metric(vector_ratio_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
THREAD_COUNT = 48
get_func_level_metric(vector_ratio_dict[THREAD_COUNT], func=func, avg=True).head(20)

In [None]:
def combine_metrics_2(metric_dict,inc_exc='Inclusive'):
    if inc_exc == 'Inclusive': todrop = 'Exclusive'
    else: todrop = 'Inclusive'
    
    for m in metric_dict:
        if not m == 'METADATA':
            print metric_dict[m].index
            metric_dict[m].index = metric_dict[m].index.droplevel()
            print metric_dict[m].index
    
    alldata = metric_dict['PAPI_TOT_CYC'].copy().drop(['Calls','Subcalls',todrop,'ProfileCalls'], axis=1)
    alldata['PAPI_TOT_CYC'] = alldata[inc_exc]
    alldata.drop([inc_exc],axis=1,inplace=True)

    for x in metric_dict.keys():
        if x in ['PAPI_TOT_CYC','METADATA']: continue
        alldata[x] = metric_dict[x][inc_exc]
    return alldata
                
data = dict(metric_data)
THREAD_COUNT = 8
metric = 'DERIVED_VIPI'


data[8]['PAPI_TOT_CYC'].sort_values(by='Inclusive',ascending=False).head(10)


# alldata = combine_metrics_2(data[THREAD_COUNT],inc_exc='Exclusive')


# for m in metric_data[THREAD_COUNT].keys():
#     if (not m == 'METADATA') and (not m == 'PAPI_TOT_CYC') and (not m == metric): alldata.drop([m],axis=1,inplace=True)

# metric_data[THREAD_COUNT][metric]
# metric_data[THREAD_COUNT]['PAPI_TOT_CYC']
# alldata.sort_values(by='DERIVED_VIPI',ascending=False)
# cyc_data = select_metric_from_scaling(metric_data, 'PAPI_TOT_CYC')

In [None]:
t,d = scaling_plot(metric_data, inclusive=True, plot=True, function="MultHelixPropTransp", metric='DERIVED_L1_MISSRATE', max=False)
print d[8]
t,d = scaling_plot(metric_data, inclusive=True, plot=True, function="MultHelixPropTransp", metric='DERIVED_L1_MISSRATE', max=True)
print d[8]

###  Resource Stalls vs thread count
Similar to above these cells show the Resource Stalls. In this case we have nothing to compute, so we simply call the function. Future work includes exploring the different types of stalls.

In [None]:
thread_list, res_stall_data = scaling_plot(metric_data, metric='PAPI_RES_STL')

In [None]:
thread_stall_data = get_thread_level_metric_scaling(metric_data, metric='PAPI_RES_STL')
thread_bar_plots(thread_stall_data, thread_list, 4000000000)

In [None]:
alldata = combine_metrics(metric_data[32],'Inclusive')
cm = sns.light_palette("yellow", as_cmap=True)
correlations_pearson = alldata.corr('pearson').fillna(0)    # Other methods: 'kendall', 'spearman'
correlations_pearson.style.format("{:.2%}").background_gradient(cmap=cm)