# Performance Analaysis of HEP codes (examples)

Authors: Brain Gravelle (gravelle@cs.uoregon.edu), Boyana Norris (norris@cs.uoregon.edu)


## 1. Prerequisites

These examples are based on the TAU and taucmdr tools from Paratools
http://taucommander.paratools.com/. Note that the analysis functionality is new and not yet publically available.

In [3]:
# A couple of scripts to set the environent and import data from a .tau set of results
from utilities import *
from metrics import *
# Plotting, notebook settings:
%matplotlib inline  
#plt.rcParams.update({'font.size': 16})
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## 2. Performance data 

TAU Commander uses TAU to run the application and measure it using runtime sampling techniques (similar to Intel VTune). Many customization options are available. For example, we may consider each function regardless of calling context, or we may decide to enable callpath profiling to see each context separately.

Available experiments:
* multi - data based on a toy version of the program.
* realistic  - data based on a run of the program with the TT35PU... input file, 10 threads, and 100 events
* event_scaling_TT35  - data based on a run of the program with the TT35PU... input file, 10 threads, and events ranging from 10 to 100
* TT70  - data based on a run of the program with the TT70PU... input file, 10 threads, and 100 events
* event_scaling_TT35  - data based on a run of the program with the TT70PU... input file, 10 threads, and events ranging from 10 to 100
* note that the event scaling runs don't have a function to properly load the data yet


### Simple explorations

In [4]:
expr_intervals = load_perf_data(application="mictest_sampling",experiment="realistic")

#level_inds = {'trial': 0, 'rank': 1, 'context': 2, 'thread': 3, 'region': 4}
print(expr_intervals.keys())
print("")
print(expr_intervals['PAPI_TOT_INS'].columns)
print("")
print(expr_intervals['PAPI_TOT_INS'].index.names)
print("")
print_metadata(expr_intervals)

expr_intervals['PAPI_TOT_INS'].sort_values(by='Inclusive',ascending=False)[["Inclusive"]].head(10)

NameError: name 'load_perf_data' is not defined

### Importing Scaling Data
No data yet, so don't use this

In [3]:
# scale_intervals = get_pandas_scaling('.tau/mictest_sampling/scaling/')

#level_inds = {'trial': 0, 'rank': 1, 'context': 2, 'thread': 3, 'region': 4}
# print(scale_intervals.keys())

# print(scale_intervals[10].keys())
# print(scale_intervals[10]['PAPI_TOT_INS'].columns)
# print(scale_intervals[10]['PAPI_TOT_INS'].index.names)

# scale_intervals[10]['PAPI_TOT_INS'][["Inclusive"]].head(10)

## Adding metrics

TODO figure this out
  - scale function
  - separate threads

These are functions that can add metrics to the dictionary

In [4]:
add_CPI(expr_intervals)
print(expr_intervals.keys())

expr_intervals['DERIVED_CPI'].head(10)

['PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD', 'PAPI_L2_TCA', 'PAPI_NATIVE_LLC_MISSES', 'PAPI_RES_STL', 'PAPI_L2_TCM', 'PAPI_TOT_INS', 'PAPI_NATIVE_UOPS_RETIRED:PACKED_SIMD', 'DERIVED_CPI', 'PAPI_LST_INS', 'PAPI_NATIVE_LLC_REFERENCES', 'PAPI_L1_TCM', 'PAPI_TOT_CYC', 'METADATA']


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Calls,Exclusive,Inclusive,ProfileCalls,Subcalls
context,thread,region,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,"[SUMMARY] (anonymous namespace)::MultHelixProp(Matriplex::Matriplex<float, 6, 6, 8> const&, Matriplex::MatriplexSym<float, 6, 8> const&, Matriplex::Matriplex<float, 6, 6, 8>&)",1.0,1.530762,1.530762,,
0,0,"[SUMMARY] (anonymous namespace)::MultHelixPropEndcap(Matriplex::Matriplex<float, 6, 6, 8> const&, Matriplex::MatriplexSym<float, 6, 8> const&, Matriplex::Matriplex<float, 6, 6, 8>&)",1.0,2.295919,2.295919,,
0,0,"[SUMMARY] (anonymous namespace)::MultHelixPropTranspEndcap(Matriplex::Matriplex<float, 6, 6, 8> const&, Matriplex::Matriplex<float, 6, 6, 8> const&, Matriplex::MatriplexSym<float, 6, 8>&)",2.0,2.746569,2.746569,,
0,0,[SUMMARY] Event::clean_cms_seedtracks(),0.862069,1.41007,1.41007,,
0,0,"[SUMMARY] FMA(__m256 const&, __m256 const&, __m256 const&)",3.0,9.215263,9.215263,,
0,0,[SUMMARY] Hit::Hit(),1.0,1.553591,1.553591,,
0,0,[SUMMARY] HitOnTrack::HitOnTrack(),1.5,2.085337,2.085337,,
0,0,"[SUMMARY] LayerOfHits::SuckInHits(std::vector<Hit, std::allocator<Hit> > const&)",0.5,0.787866,0.787866,,
0,0,"[SUMMARY] Matriplex::Matriplex<int, 1, 1, 8>::operator()(int, int, int)",1.0,1.863106,1.863106,,
0,0,"[SUMMARY] Matriplex::MatriplexSym<float, 3, 8>::SlurpIn(char const*, int*, int)",1.0,1.481942,1.481942,,


### Metric Generation

gen_metric generates the boring bits of the metric adding function
* List the metrics you will use
* provide a name for the new metric
* paste into metrics.py
* implement the math bit

In [5]:
print(gen_metric(['PAPI_NATIVE_UOPS_RETIRED_PACKED_SIMD', 'PAPI_L1_TCM'], "VECTOR_PER_MISS"))

def add_VECTOR_PER_MISS(metrics):
	if (not metrics.has_key(PAPI_NATIVE_UOPS_RETIRED_PACKED_SIMD)):
		print 'ERROR adding VECTOR_PER_MISS to metric dictionary'
		return False	a0 = metrics[PAPI_NATIVE_UOPS_RETIRED_PACKED_SIMD].copy()
	a0.index = a0.index.droplevel()
	u0 = a0.unstack()
	if (not metrics.has_key(PAPI_L1_TCM)):
		print 'ERROR adding VECTOR_PER_MISS to metric dictionary'
		return False	a1 = metrics[PAPI_L1_TCM].copy()
	a1.index = a1.index.droplevel()
	u1 = a1.unstack()
	metrics[VECTOR_PER_MISS] = "PLEASE IMPLEMENT THIS PART"

	return True





## Interesting bits

This is where the stuff is actually calculated.

In [6]:
# levels: 0=trial, 1=node, 2=context, 3=thread, 4=region name -- deprecated
# levels: 0=rank, 1=context, 2=thread, 3=region name -- deprecated

       
n=10

def get_hotspots(metric):
    print('selected metric: %s\n' %metric)
    hotspots(expr_intervals[metric], n, 1)

    print('='*80)

    filtered_dfs = filter_libs_out(expr_intervals[metric])
    hotspots(filtered_dfs, n, 1)
    
get_hotspots('PAPI_TOT_CYC')



selected metric: PAPI_TOT_CYC

Hotspot Analysis Summary
The code regions with largest inclusive time are: 
1: [SUMMARY] syscall  (2644723369)
2: [SUMMARY] std::__detail::_Mod_range_hashing::operator()(unsigned long, unsigned long) const  (1778095163)
3: [SUMMARY] _int_free  (1231159555)
4: [SUMMARY] _int_malloc  (1219269406)
5: [SUMMARY] __GI___libc_malloc  (1202496560)
6: [SUMMARY] Event::clean_cms_seedtracks()  (1112271578)
7: [SUMMARY] UNRESOLVED /storage/packages/intel/vtune_amplifier_xe_2017.5.0.526192/lib64/libstdc++.so.6.0.20 (881792500)
8: [SUMMARY] std::_Hashtable<int, std::pair<int const, int>, std::allocator<std::pair<int const, int> >, std::__detail::_Select1st, std::equal_to<int>, std::hash<int>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true> >::_M_insert_unique_node(unsigned long, unsigned long, std::__detail::_Hash_node<std::pair<int const, int>, false>*)  (

In [7]:
get_hotspots('DERIVED_CPI')

selected metric: DERIVED_CPI

Hotspot Analysis Summary
The code regions with largest inclusive time are: 
1: [SUMMARY] __GI___sched_yield  (1383.3555433316772)
2: [SUMMARY] Matriplex::Matriplex<float, 6, 1, 8>::CopyOut(int, float*) const  (337.8687114599183)
3: [SUMMARY] FMA(__m256 const&, __m256 const&, __m256 const&)  (143.3827012750045)
4: [SUMMARY] helixAtRFromIterativeCCS(Matriplex::Matriplex<float, 6, 1, 8> const&, Matriplex::Matriplex<int, 1, 1, 8> const&, Matriplex::Matriplex<float, 6, 1, 8>&, Matriplex::Matriplex<float, 1, 1, 8> const&, Matriplex::Matriplex<float, 6, 6, 8>&, int, bool)  (140.12204158515513)
5: [SUMMARY] (anonymous namespace)::MultHelixPropEndcap(Matriplex::Matriplex<float, 6, 6, 8> const&, Matriplex::MatriplexSym<float, 6, 8> const&, Matriplex::Matriplex<float, 6, 6, 8>&)  (33.74236172284297)
6: [SUMMARY] Matriplex::MatriplexSym<float, 6, 8>::CopyIn(int, float const*)  (30.655652580671333)
7: [SUMMARY] _INTERNAL_27_______src_tbb_scheduler_cpp_dbc24cd9::__TBB_m

In [17]:
add_L1_missrate(expr_intervals)
get_hotspots('DERIVED_L1_MISSRATE')

selected metric: DERIVED_L1_MISSRATE

Hotspot Analysis Summary
The code regions with largest standard deviation are: 
1: [SUMMARY] void _INTERNAL1f3c31c2::helixAtRFromIterativeCCS_impl<Matriplex::Matriplex<float, 6, 1, 16>, Matriplex::Matriplex<int, 1, 1, 16>, Matriplex::Matriplex<float, 6, 1, 16>, Matriplex::Matriplex<float, 1, 1, 16>, Matriplex::Matriplex<float, 6, 6, 16> >(Matriplex::Matriplex<float, 6, 1, 16> const&, Matriplex::Matriplex<int, 1, 1, 16> const&, Matriplex::Matriplex<float, 6, 1, 16>&, Matriplex::Matriplex<float, 1, 1, 16> const&, Matriplex::Matriplex<float, 6, 6, 16>&, int, int, int, bool)  (0.224730558282)
2: [SUMMARY] _INTERNAL_27_______src_tbb_scheduler_cpp_dbc24cd9::__TBB_machine_pause(int)  (0.0612775331484)
3: [SUMMARY] __intel_mic_avx512f_memset  (0.0539007582363)
4: [SUMMARY] Matriplex::MatriplexSym<float, 3, 16>::SlurpIn(char const*, __m512i&)  (0.0532490464803)
5: [SUMMARY] __intel_mic_avx512f_memcpy  (0.0464728354125)
6: [SUMMARY] Matriplex::MatriplexSym<f

In [18]:
get_hotspots('PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD')

selected metric: PAPI_NATIVE_UOPS_RETIRED:SCALAR_SIMD

Hotspot Analysis Summary
The code regions with largest standard deviation are: 
1: [SUMMARY] __read_nocancel  (336101904)
2: [SUMMARY] Event::resetLayerHitMap(bool)  (5005640)
3: [SUMMARY] make_validation_tree(char const*, std::vector<Track, std::allocator<Track> >&, std::vector<Track, std::allocator<Track> >&)  (4877635)
4: [SUMMARY] void std::__uninitialized_default_n_1<false>::__uninit_default_n<Hit*, unsigned long>(Hit*, unsigned long)  (4655578)
5: [SUMMARY] std::vector<Hit, std::allocator<Hit> >::size() const  (4235558)
6: [SUMMARY] ROOT::Math::SVector<float, 3u>::SVector()  (2695292)
7: [SUMMARY] std::vector<HitID, std::allocator<HitID> >::size() const  (1925263)
8: [SUMMARY] _IO_file_xsgetn  (1925188)
9: [SUMMARY] Matriplex::MatriplexSym<float, 3, 16>::SlurpIn(char const*, __m512i&)  (1540212)
10: [SUMMARY] ROOT::Math::SMatrix<float, 6u, 6u, ROOT::Math::MatRepSym<float, 6u> >::SMatrix()  (1540149)
Hotspot Analysis Summary
T