# MICRO 2024 FuseMax Artifact Evaluation Figures 6-12

This notebook performs the modeling and generates the figures used in the paper, "FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design", which appeared in MICRO 2024.

**Warning: Running a cell more than once will overwrite the previous output produced by the cell. Please back up all results as soon as they are produced in case you accidentally rerun the cell.** All results are stored in `workspace/outputs/generated/`.

In [None]:
%matplotlib inline

import os
import sys
sys.path.insert(0, "..")

import src.scripts.check as check
import src.scripts.run as run
import src.utils.graph as graph

Set `pregenerated` to `True` to use pre-generated results for the figures and to `False` to display results you generated.

In [None]:
from ipywidgets import interactive

pregenerated = False

def set_pregenerated(**kwargs):
    global pregenerated
    pregenerated = kwargs['pregenerated']

options = {"pregenerated": [False, True]}

w = interactive(set_pregenerated, **options)

display(w)

## Collect Data

Run the Timeloop/Accelergy models of the baselines and various configurations of FuseMax. These cells can be skipped by instead using the `pregenerated` results.

Expected Run Time: 9 hours

**Warning: Running a cell more than once will overwrite the previous output produced by the cell. Please back up all results as soon as they are produced in case you accidentally rerun the cell.** All results are stored in `workspace/outputs/generated/`.

### Model Attention

The following cells model the five configurations of attention: two baselines—the unfused baseline and [FLAT](https://dl.acm.org/doi/10.1145/3575693.3575747)—and three configurations of FuseMax—adding the cascade, architecture, and binding. These cells are required to generate Figures 6-11.

#### Model the Unfused Baseline

In [None]:
run.attn("unfused")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("attn-unfused")

#### Model FLAT

In [None]:
run.attn("flat")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("attn-flat")

#### Model `+Cascade`

Model an attention accelerator that uses the FuseMax cascade on the FLAT architecture.

In [None]:
run.attn("cascade")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("attn-cascade")

#### Model `+Architecture`

Model an attention accelerator that uses the FuseMax cascade and architecture, but evaluates one tile at a time (instead of FuseMax's interleaved binding).

In [None]:
run.attn("arch")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("attn-arch")

#### Model `+Binding`

Model the full FuseMax attention accelerator

In [None]:
run.attn("binding")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("attn-binding")

### Model End-to-End Transformer Inference

The following cells model the linear layers—Q, K, and V projection, deprojection, and the FFN layers—on both the [FLAT](https://dl.acm.org/doi/10.1145/3575693.3575747) and FuseMax architectures. These cells are required to generate Figures 10 and 11.

#### Model Linear Layers on FLAT Architecture

In [None]:
run.end2end("flat")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("end2end-flat")

#### Model Linear Layers on FuseMax Architecture

In [None]:
run.end2end("proposal")

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("end2end-proposal")

### Sweep Architecture Area

The following cell sweeps various accelerator architectures design points for attention to build a pareto-optimal curve of design points.

In [None]:
run.pareto()

Check that the outputs match the results generated by the authors:

In [None]:
check.outputs("pareto")

## Draw Figures

Note: All figures are also saved to `workspace/outputs/generated/figs` and can be compared with `workspace/outputs/pregenerated/figs`.

### Figure 6a

In [None]:
graph.draw_bar_graph(graph.load_data("util_1d", pregenerated=pregenerated), "Utilization 1D", "fig6a")

### Figure 6b

In [None]:
graph.draw_bar_graph(graph.load_data("util_2d", pregenerated=pregenerated), "Utilization 2D", "fig6b")

### Figure 7

In [None]:
graph.draw_breakdown(pregenerated=pregenerated)

### Figure 8

In [None]:
graph.draw_bar_graph(graph.load_data("latency", data_cb=lambda a, u: u / a, pregenerated=pregenerated), "Speedup", "fig8")

### Figure 9

In [None]:
graph.draw_bar_graph(graph.load_data("energy", data_cb=lambda a, u: a / u, pregenerated=pregenerated), "Energy Use", "fig9")

### Figure 10

In [None]:
graph.draw_bar_graph(graph.load_data("latency", kernel="end2end", data_cb=lambda a, u: u / a, pregenerated=pregenerated), "Speedup", "fig10")

### Figure 11

In [None]:
graph.draw_bar_graph(graph.load_data("energy", kernel="end2end", data_cb=lambda a, u: a / u, pregenerated=pregenerated), "Energy Use", "fig11")

### Figure 12

In [None]:
graph.draw_pareto(pregenerated=pregenerated)