# Aggregate Instruction Statistics

This notebook combines and extends the content presented in [Opcode-Instanced Metrics](Opcode_instanced_metrics.ipynb) and [Source-Correlated Metrics](Source_correlated_metrics.ipynb). You will learn how to

* Inspect instruction-level SASS metrics
* Correlate SASS addresses and CUDA-C lines
* Aggregate instruction statistics not readily available from existing UI

## Setup

First, import NVIDIA Nsight Compute's Python Report Interface (PRI) as `ncu_report`
and load an `ncu-rep` report file with `load_report`:

In [None]:
import ncu_report

report_file_path = "../sample_reports/mergeSort.ncu-rep"
report = ncu_report.load_report(report_file_path)

Now that we have our report loaded, it's time to aggregate the data we are interested in. The next code cell implements the following algorithm:

* Iterate over all executed SASS instructions (opcodes and values are given as examples and may differ in your report)
```
Address         Source                             inst_executed
0x7f3db4452960  MUFU.RCP64H R5, 9                     64320
...             ...                                    -
0x7f3db4452990  IMAD.MOV.U32 R20, RZ, RZ, 0x0         64320
0x7f3db44529a0  IMAD.MOV.U32 R21, RZ, RZ, 0x40220000  64320
...             ...                                    -
0x7f3db44529c0  IMAD.MOV.U32 R4, RZ, RZ, 0x1          64320
...             ...                                    -
0x7f3db44529f0  UMOV UR4, URZ                         64320
0x7f3db4452a00  BSSY B0, 0x7f3db4452cd0               64320
...             ...                                    -
```

* Extract opcode part from disassembly
```
MUFU.RCP64H R5, 9                    -> MUFU.RCP64H
IMAD.MOV.U32 R20, RZ, RZ, 0x0        -> IMAD.MOV.U32
IMAD.MOV.U32 R21, RZ, RZ, 0x40220000 -> IMAD.MOV.U32
```

* Correlate SASS with CUDA-C and aggregate values
```
Line 101:
  MUFU.RCP64H  : 64320
  IMAD.MOV.U32 : 192960
  UMOV.UR4     : 64320
  BSSY         : 64320
```

In [None]:
import re

def aggregate_instructions(inst_executed):
    # get all its instance (per-instruction) values
    num_instances = inst_executed.num_instances()
    # get program counters (addresses within the function)
    pcs = inst_executed.correlation_ids()

    # regex pattern to extract the SASS opcode part we want
    opcode_pattern = re.compile("\s*(?:@\!?P\d\s+)?([\w\.]+)\s+.*")

    # iterate over all executed instructions
    inst_locations = dict()
    for inst in range(0, num_instances):
        pc = pcs.as_uint64(inst)
        value = inst_executed.as_uint64(inst)

        # aggregate the SASS-level metric to CUDA-C using
        # lineinfo correlation from the compiler
        source = kernel.source_info(pc)
        if value > 0 and source:
            file = source.file_name()
            line = source.line()

            # get the SASS at this address and extract the opcode
            # track the number of times the opcode was executed
            sass = kernel.sass_by_pc(pc)
            m = opcode_pattern.match(sass)
            if m:
                opcode = m.group(1)
                if not file in inst_locations:
                    inst_locations[file] = dict()
                if not line in inst_locations[file]:
                    inst_locations[file][line] = dict()
                if not opcode in inst_locations[file][line]:
                    inst_locations[file][line][opcode] = 0
                inst_locations[file][line][opcode] += value

    return inst_locations


kernel = report[0][0]
print(f"Kernel: {kernel.name()}")

# get the executed warp-level instructions metric for this kernel
# and aggregate statistics
metric = kernel["inst_executed"]
stats = aggregate_instructions(metric)


Finally, we sort the results per CUDA-C line and print everything:
```
Line 101:
  IMAD.MOV.U32 : 192k
  BSSY         : 64k
  MUFU.RCP64H  : 64k
  UMOV.UR4     : 64k
```

In [None]:
for file in stats:
    print(f"Line| File: {file}")
    for line in stats[file]:
        print(f"{line:-4}")
        for opcode,value in dict(sorted(stats[file][line].items(), reverse=True, key=lambda item: item[1])).items():
            print(f"    | {opcode:15}:{(value/1000):-6.1f}k")