In [3]:
# Execute the following commands to clear output folders
# !rm -rf output
# !rm -rf model

# SODA Toolchain

![tutorial-flow](imgs/flow-diagram-full.png)

# High-Level Application Input (TensorFlow)

### Build a model in tensorflow (Step 1)

In [4]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
import numpy as np
tf.random.set_seed(seed=0)
print(tf.__version__)

in1 = keras.layers.Input(shape=(32,32,1))
tmp = keras.layers.Conv2D(filters=1, kernel_size=(5,5),
                          input_shape=(32,32),
                          padding='same', 
                          strides=(2, 2),
                          activation='relu', 
                          use_bias=True)(in1)
tmp = keras.layers.Flatten()(tmp)
tmp = keras.layers.Dense(units=8, activation='relu')(tmp)
tmp = keras.layers.Dense(units=4, activation='relu')(tmp)
out = keras.layers.Dense(units=2, activation='softmax')(tmp)
model = keras.models.Model(inputs=[in1], outputs=out)

# Compile model with optimizer
model.compile(optimizer="adam",
                loss="sparse_categorical_crossentropy",
                metrics=["accuracy"])

2.9.1


### Convert model to protobuf

In [5]:
!mkdir -p output

In [6]:
save_path = os.path.join(os.getcwd(), "model/simple/")

# Save model to SavedModel format
tf.saved_model.save(model, save_path)

# Convert Keras model to ConcreteFunction
full_model = tf.function(lambda x: model(x))
full_model = full_model.get_concrete_function(
    x=[
        tf.TensorSpec(model.inputs[0].shape, model.inputs[0].dtype, name='x1')
    ])

# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)

# Save frozen graph from frozen ConcreteFunction to hard drive
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                    logdir=os.getcwd(),
                    name="output/frozen_graph.pbtxt",
                    as_text=True)




INFO:tensorflow:Assets written to: /files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/model/simple/assets


INFO:tensorflow:Assets written to: /files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/model/simple/assets
2022-05-27 03:06:27.849328: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2022-05-27 03:06:27.849516: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session


'/files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/output/frozen_graph.pbtxt'

![tutorial-flow](imgs/flow-diagram.png)

### Transform protobuf into MLIR (Step 2)




In [7]:
!scripts/protobuf-to-tosa.sh output/frozen_graph.pbtxt output/tosa.mlir

2022-05-27 10:06:35.498195: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Lower MLIR to Linalg on Buffers (Step 3)

In [8]:
!scripts/tosa-to-linalg.sh output/tosa.mlir output/linalg-buffers.mlir

![tutorial-flow](imgs/flow-diagram.png)

# SODA-OPT: HW/SW Partitioning and Optimizer (Step 4)

## How to use soda.launch?

### Automatic selection of custom accelerator region

Using the pass: `-convert-<abstraction_name>-<operation_name>-to-soda`

Such as: `-convert-linalg-generic-to-soda`

### Manual selection of custom accelerator region

Adding the following lines around any code that will become the accelerator:

```
soda.launch {
  // ...
  // Code to be transformed into an accelerator
  // ...
  soda.terminator
}
```

Run next cell and edit the generated file.

In [9]:
!cp output/linalg-buffers.mlir output/01searched-edited.mlir

# Perform manual edit!

Edit the [file](output/01searched-edited.mlir).

Around line 99, modify code to look like this:

```
soda.launch {
  linalg.batch_matmul ins(%23, %7 : memref<1x4x8xf32>, memref<1x8x4xf32>) 
                      outs(%25 : memref<1x4x4xf32>)
  soda.terminator
}
```

## Optimization pipeline

![optimizations](imgs/optimization-table.png)

### Kernel without SODA-OPT optimizations (Baseline)

In [10]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode=no-aa \
    -lower-all-to-llvm=use-bare-ptr-memref-call-conv \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04baseline.mlir \
    2>&1 | cat > output/05intermediate-baseline.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04baseline.mlir \
    -o output/05baseline.ll
)

Visualize [intermediate file](output/05intermediate-baseline.mlir)

### Kernel with SODA-OPT optimizations

In [11]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode \
    -soda-opt-pipeline-for-bambu=use-bare-ptr-memref-call-conv \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04optimized.mlir \
    2>&1 | cat > output/05intermediate-optimized.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04optimized.mlir \
    -o output/05optimized.ll
)


Visualize [intermediate file](output/05intermediate-optimized.mlir)

![tutorial-flow](imgs/flow-diagram.png)

# Bambu: Synthesizing the Outlined Kernel (Step 5)

The following configurations are passed to our backend HLS tool:

* Target: FPGA generation using the Xilinx xc7vx690t-3ffg1930-VVD device
* Memory technology: BRAM
* Number of memory channels: 2
  * Supports 2 parallel reads and 2 parallel writes
* Target frequency: 200MHz (5ns period)
* Using bambu's floating-point operation support

### Baseline kernel

In [12]:
! scripts/run-bambu.sh baseline
# Takes aprox 20 seconds to execute

/files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/output/baseline /files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version
 ==  Bambu executed with: bambu -v3 --print-dot -lm --soft-float --compiler=I386_CLANG10 -O2 --device-name=xc7vx690t-3ffg1930-VVD --clock-period=5 --no-iob --experimental-setup=BAMBU-BALANCED-MP --channels-number=2 --memory-allocation-policy=ALL_BRAM --disable-function-proxy --generate-tb=main_kernel_test.xml --simulate --simulator=VERILATOR --top-fname=main_kernel input.ll 


********************************************************************************
                    ____                  _
                   | __ )  __ _ _ __ ___ | |_   _   _
                   |  _ \ / _` | '_ ` _ \| '_ \| | | |
                   | |_) | (_| | | | | | | |_) | |_| |
                   |____/ \__,_|_| |_| |_|_.__/ \__,_|

********************************************************************************


In [13]:
baseline_runtime = ""

for runtime in open('output/baseline/bambu-log').readlines():
    if "Average execution" in runtime:
        baseline_runtime = [int(s) for s in runtime.split() if s.isdigit()][0]

print("Average execution in cycles: {}".format(baseline_runtime))

Average execution in cycles: 1542


Visualize [Intermediate Dot File](output/baseline/HLS_output/dot/main_kernel/HLS_STGraph.dot)

### Optimized kernel

In [14]:
! scripts/run-bambu.sh optimized
# Takes aprox 55 seconds to execute

/files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/output/optimized /files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version
 ==  Bambu executed with: bambu -v3 --print-dot -lm --soft-float --compiler=I386_CLANG10 -O2 --device-name=xc7vx690t-3ffg1930-VVD --clock-period=5 --no-iob --experimental-setup=BAMBU-BALANCED-MP --channels-number=2 --memory-allocation-policy=ALL_BRAM --disable-function-proxy --generate-tb=main_kernel_test.xml --simulate --simulator=VERILATOR --top-fname=main_kernel input.ll 


********************************************************************************
                    ____                  _
                   | __ )  __ _ _ __ ___ | |_   _   _
                   |  _ \ / _` | '_ ` _ \| '_ \| | | |
                   | |_) | (_| | | | | | | |_) | |_| |
                   |____/ \__,_|_| |_| |_|_.__/ \__,_|

********************************************************************************

In [15]:
optimized_runtime = ""

for runtime in open('output/optimized/bambu-log').readlines():
    if "Average execution" in runtime:
        optimized_runtime = [int(s) for s in runtime.split() if s.isdigit()][0]

print("Average execution in cycles: {}".format(optimized_runtime))


Average execution in cycles: 64


Visualize [Intermediate Dot File](output/optimized/HLS_output/dot/main_kernel/HLS_STGraph.dot)

## Comparison of runtime results

* Display runtime
* Display [verilog output file](output/optimized/main_kernel.v)

In [16]:
print("Average execution in cycles of Baseline kernel:  {}".format(baseline_runtime))
print("Average execution in cycles of Optimized kernel: {}".format(optimized_runtime))
print("Speedup: {:.1f}".format(float(baseline_runtime/optimized_runtime)))

Average execution in cycles of Baseline kernel:  1542
Average execution in cycles of Optimized kernel: 64
Speedup: 24.1


# Commandline interface

To visualize all possible paramenters for our optimization passes run:

- `soda-opt -h`

```
      --soda-opt-pipeline-for-bambu                    
        --affine-tile-size=<ulong>                     
        --bitwidth-of-index-type=<uint>                
        --max-alloc-size-in-bytes=<uint>               
        --max-rank-of-allocated-memref=<uint>          
        --number-of-full-unrolls=<uint>                
        --permutation-map=<uint>                       
        --use-bare-ptr-memref-call-conv                
        --no-alloca-promotion                          
        --no-buffer-trick                              
        --no-scalar-replacement                        
  
```

In [17]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt -h 2>&1 | cat > output/helpfile
)

Open [help file](output/helpfile)

### Modifying the number of unrolls

In [None]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode \
    -soda-opt-pipeline-for-bambu="use-bare-ptr-memref-call-conv number-of-full-unrolls=1" \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04optimized.mlir \
    2>&1 | cat > output/05intermediate-optimized.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04optimized.mlir \
    -o output/05optimized.ll
)

Visualize [intermediate file](output/05intermediate-optimized.mlir)

In [None]:
! scripts/run-bambu.sh optimized
# 57 Seconds to execute

In [None]:
optimized_runtime = ""

for runtime in open('output/optimized/bambu-log').readlines():
    if "Average execution" in runtime:
        optimized_runtime = [int(s) for s in runtime.split() if s.isdigit()][0]

print("Average execution in cycles: {}".format(optimized_runtime))

![tutorial-flow](imgs/flow-diagram.png)

# Vivado Flow: Place and Route generated Verilog (Step 6)

To successfully execute the foolowing steps, Bambu and Vivado 2020.2 are exptected to be installed **locally**

### Baseline kernel

In [18]:
# scripts/run-synthesis.sh <baseline|optimized> <path_to_vivado_installation>
! scripts/run-synthesis.sh baseline /files0/Xilinx/Vivado/2020.2

# Approx. 4min to execute

/files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/output/baseline /files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version
 ==  Bambu executed with: bambu -v3 --print-dot -lm --soft-float --compiler=I386_CLANG10 -O2 --device-name=xc7vx690t-3ffg1930-VVD --clock-period=5 --no-iob --experimental-setup=BAMBU-BALANCED-MP --channels-number=2 --memory-allocation-policy=ALL_BRAM --disable-function-proxy --generate-tb=main_kernel_test.xml --simulate --simulator=VERILATOR --evaluation --top-fname=main_kernel --xilinx-root=/files0/Xilinx/Vivado/2020.2 input.ll 


********************************************************************************
                    ____                  _
                   | __ )  __ _ _ __ ___ | |_   _   _
                   |  _ \ / _` | '_ ` _ \| '_ \| | | |
                   | |_) | (_| | | | | | | |_) | |_| |
                   |____/ \__,_|_| |_| |_|_.__/ \__,_|

*************************

### Optimized kernel

In [19]:
! scripts/run-synthesis.sh optimized /files0/Xilinx/Vivado/2020.2

# Approx. 11min to execute

/files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version/output/optimized /files0/extended/bohm747/Development/soda/soda-opt/docs/tutorials/isc2022/docker-version
 ==  Bambu executed with: bambu -v3 --print-dot -lm --soft-float --compiler=I386_CLANG10 -O2 --device-name=xc7vx690t-3ffg1930-VVD --clock-period=5 --no-iob --experimental-setup=BAMBU-BALANCED-MP --channels-number=2 --memory-allocation-policy=ALL_BRAM --disable-function-proxy --generate-tb=main_kernel_test.xml --simulate --simulator=VERILATOR --evaluation --top-fname=main_kernel --xilinx-root=/files0/Xilinx/Vivado/2020.2 input.ll 


********************************************************************************
                    ____                  _
                   | __ )  __ _ _ __ ___ | |_   _   _
                   |  _ \ / _` | '_ ` _ \| '_ \| | | |
                   | |_) | (_| | | | | | | |_) | |_| |
                   |____/ \__,_|_| |_| |_|_.__/ \__,_|

************************

## Comparison of synthesis results

* Cycles
* LUTs
* DSPs

In [20]:
import pandas as pd
experiment_id=1
rows = ['CYCLES', 'AREAxTIME', 'AREA',
        'SLICE', 'SLICE_LUTS', 'REGISTERS',
        'DSPS', 'BRAMS', 'PERIOD',
        'CLOCK_SLACK', 'FREQUENCY', 'TIME',
        'TOTAL_TIME', 'TOTAL_CYCLES', 'HLS_execution_time']
df_baseline = pd.read_xml('output/baseline/bambu_results_{}.xml'.format(experiment_id), names=['baseline'])
df_optimized = pd.read_xml('output/optimized/bambu_results_{}.xml'.format(experiment_id), names=['optimized'])
df = pd.concat([df_baseline, df_optimized], axis=1)
df.index=rows
df['optimized/baseline']=df['optimized']/df['baseline']
df


Unnamed: 0,baseline,optimized,optimized/baseline
CYCLES,1542.0,64.0,0.041505
AREAxTIME,17519.778408,5412.904448,0.30896
AREA,2398.0,17038.0,7.105088
SLICE,1010.0,7145.0,7.074257
SLICE_LUTS,2398.0,17038.0,7.105088
REGISTERS,2265.0,19074.0,8.421192
DSPS,2.0,32.0,16.0
BRAMS,0.0,0.0,
PERIOD,4.738,4.964,1.047699
CLOCK_SLACK,0.262,0.036,0.137405


### Post place and route comparison

Considering a matrix multiply kernel has approximatelly 2xNxMxK arithmetic operations

And our selected kernel has the following sizes: 

```
linalg.batch_matmul ins(%23, %6 : memref<1x4x8xf32>, memref<1x8x4xf32>) 
                    outs(%25 : memref<1x4x4xf32>)
```
M=4, K=8, N=4

We have approximatelly **256** floating point aritihmetic operations

In [21]:
mega_multiplier=1e6
flop_count = 256 # arithmetic float point operations
target_frequency = 200e+6 # 200MHz

baseline_utilization_area=df.loc['AREA','baseline']
optimized_utilization_area=df.loc['AREA','optimized']


baseline_runtime_in_s = baseline_runtime/target_frequency 
optimized_runtime_in_s = optimized_runtime/target_frequency

baseline_flops= flop_count/baseline_runtime_in_s
optimized_flops= flop_count/optimized_runtime_in_s


print("Execution in cycles of Baseline kernel:  {}".format(baseline_runtime))
print("Execution in cycles of Optimized kernel:   {}".format(optimized_runtime))

print("Speedup: \t\t\t{:.2f}x".format(baseline_runtime/optimized_runtime))
print("Area utilization overhead: \t {:.2f}x".format(optimized_utilization_area/baseline_utilization_area))

print("Baseline  \t\t\t {:.2f} MFLOPS ".format(baseline_flops/mega_multiplier))
print("Optimized \t\t\t{:.2f} MFLOPS".format(optimized_flops/mega_multiplier))


Execution in cycles of Baseline kernel:  1542
Execution in cycles of Optimized kernel:   64
Speedup: 			24.09x
Area utilization overhead: 	 7.11x
Baseline  			 33.20 MFLOPS 
Optimized 			800.00 MFLOPS


## Generated Design files

Vivado checkpoints can be found here:

* output/baseline/HLS_output/Synthesis/vivado_flow_0/post_place.dcp
* output/optimized/HLS_output/Synthesis/vivado_flow_0/post_place.dcp
* output/output/<baseline|optimized>/HLS_output/Synthesis/vivado_flow_X/

### Baseline and Optimized Side by Side

![baseline_view](imgs/baseline_view.png)
![optimized_view](imgs/optimized_view.png)

# Thank you!