# SODA Toolchain

![tutorial-flow](imgs/flow-diagram.png)

# High-Level Application Input (TensorFlow)

### Build a model in TensorFlow (Step 1)

In [39]:
import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
import numpy as np
import voxelmorph as vxm

tf.random.set_seed(seed=0)
print(tf.__version__)
from tensorflow import keras

model_path = "/home/users/giuseppe.sorrentino/SODA/NonRigidReg/models/abdomreg_intra.h5"
model = vxm.networks.VxmDense.load(
    model_path
)
for input_tensor in model.inputs:
    print(input_tensor.shape)

print(model.inputs)
print(model.outputs)



2.13.1




(None, 32, 288, 288, 1)
(None, 32, 288, 288, 1)
[<KerasTensor: shape=(None, 32, 288, 288, 1) dtype=float32 (created by layer 'vxm_dense_source_input')>, <KerasTensor: shape=(None, 32, 288, 288, 1) dtype=float32 (created by layer 'vxm_dense_target_input')>]
[<KerasTensor: shape=(None, 32, 288, 288, 1) dtype=float32 (created by layer 'vxm_dense_transformer')>, <KerasTensor: shape=(None, 16, 144, 144, 3) dtype=float32 (created by layer 'vxm_dense_flow_resize')>]


### Convert model to protobuf

In [None]:
!mkdir -p output

In [47]:
# 2) Salva in SavedModel (opzionale, se serve)
save_path = os.path.join(os.getcwd(), "model/simple/")
tf.saved_model.save(model, save_path) 

@tf.function
def infer(moving, fixed):
    return model([moving, fixed])

inp0, inp1 = model.inputs
concrete_func = infer.get_concrete_function(
    moving=tf.TensorSpec(shape=inp0.shape, dtype=inp0.dtype, name=inp0.name.split(':')[0]),
    fixed =tf.TensorSpec(shape=inp1.shape, dtype=inp1.dtype, name=inp1.name.split(':')[0])
)

# 5) Congela e salva il grafo
frozen_func = convert_variables_to_constants_v2(concrete_func)
tf.io.write_graph(
    graph_or_graph_def=frozen_func.graph,
    logdir=os.getcwd(),
    name="output/frozen_graph.pbtxt",
    as_text=True
)

#for op in frozen_func.graph.get_operations():
#    print(op.name)

INFO:tensorflow:Assets written to: /home/users/giuseppe.sorrentino/SODA/NonRigidReg/soda-opt/docs/tutorials/tensorflow/docker-version/model/simple/assets


INFO:tensorflow:Assets written to: /home/users/giuseppe.sorrentino/SODA/NonRigidReg/soda-opt/docs/tutorials/tensorflow/docker-version/model/simple/assets
2025-04-28 22:23:09.425958: I tensorflow/core/grappler/devices.cc:75] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA or ROCm support)
2025-04-28 22:23:09.426266: I tensorflow/core/grappler/clusters/single_machine.cc:357] Starting new session


'/home/users/giuseppe.sorrentino/SODA/NonRigidReg/soda-opt/docs/tutorials/tensorflow/docker-version/output/frozen_graph.pbtxt'

![tutorial-flow](imgs/flow-diagram.png)

### Transform protobuf into MLIR (Step 2)




In [48]:
for input_tensor in model.inputs:
    print(f"Nome dell'input: {input_tensor.name} - shape: {input_tensor.shape} - dtype: {input_tensor.dtype}")

for output_tensor in model.outputs:
    print(f"Nome dell'output: {output_tensor.name} - shape: {output_tensor.shape} - dtype: {output_tensor.dtype}")

Nome dell'input: vxm_dense_source_input
Nome dell'input: vxm_dense_target_input
Nome dell'output: vxm_dense_transformer/map/TensorArrayV2Stack/TensorListStack:0
Nome dell'output: vxm_dense_flow_resize/map/TensorArrayV2Stack/TensorListStack:0


In [52]:
!scripts/protobuf-to-tosa.sh output/frozen_graph.pbtxt output/tosa.mlir

2025-04-28 20:26:09.984063: E tensorflow/compiler/xla/status_macros.cc:57] INTERNAL: RET_CHECK failure (tensorflow/compiler/mlir/tensorflow/translate/mlir_roundtrip_flags.cc:109) absl::SimpleAtoi(dim_str, &size) 
*** Begin stack trace ***
	tsl::CurrentStackTrace[abi:cxx11]()
	
	xla::status_macros::MakeErrorStream::Impl::GetStatus()
	
	
	
	
	
	
	
	
	__libc_start_main
	
*** End stack trace ***



### Lower MLIR to Linalg on Buffers (Step 3)

In [None]:
!scripts/tosa-to-linalg.sh output/tosa.mlir output/linalg-buffers.mlir

![tutorial-flow](imgs/flow-diagram.png)

# SODA-OPT: HW/SW Partitioning and Optimizer (Step 4)

## How to use soda.launch?

### Automatic selection of custom accelerator region

Using the pass: `-convert-<abstraction_name>-<operation_name>-to-soda`

Such as: `-convert-linalg-generic-to-soda`

### Manual selection of custom accelerator region

Adding the following lines around any code that will become the accelerator:

```mlir
soda.launch {
  // ...
  // Code to be transformed into an accelerator
  // ...
  soda.terminator
}
```

Run next cell and edit [file](output/01searched-edited.mlir).

In [None]:
!cp output/linalg-buffers.mlir output/01searched-edited.mlir

# Perform manual edit!

> **⚠️ <span style="color:red;">IMPORTANT:</span> Please modify the file as described below.**

Edit the [file](output/01searched-edited.mlir).

Modify line 101 to the following lines:

```mlir
    soda.launch {
      linalg.batch_matmul ins(%expand_shape_14, %4 : memref<1x4x8xf32>, memref<1x8x4xf32>) outs(%alloc_16 : memref<1x4x4xf32>)
      soda.terminator
    }
```

## Optimization pipeline

![optimizations](imgs/optimization-table.png)

### Kernel without SODA-OPT optimizations (Baseline)

In [None]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode=no-aa \
    -lower-all-to-llvm=use-bare-ptr-memref-call-conv \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04baseline.mlir \
    2>&1 | cat > output/05intermediate-baseline.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04baseline.mlir \
    -o output/05baseline.ll
)

Visualize [intermediate file](output/05intermediate-baseline.mlir)

### Kernel with SODA-OPT optimizations

In [None]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode \
    -soda-opt-pipeline-for-bambu=use-bare-ptr-memref-call-conv \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04optimized.mlir \
    2>&1 | cat > output/05intermediate-optimized.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04optimized.mlir \
    -o output/05optimized.ll
)


Visualize [intermediate file](output/05intermediate-optimized.mlir)

![tutorial-flow](imgs/flow-diagram.png)

# Bambu: Synthesizing the Outlined Kernel (Step 5)

The following configurations are passed to our backend HLS tool:

* Target: ASIC generation using the Nangate cell library with the FreePDK 45nm kit
* Memory technology: SRAM
* Number of memory channels: 2
  * Supports 2 parallel reads and 2 parallel writes
* Target frequency: 200MHz (5ns period)
* Using bambu's floating-point operation support

You can change parameters passed to bambu in [scripts/run-bambu.sh](scripts/run-bambu.sh)

### Baseline kernel

In [None]:
! scripts/run-bambu.sh baseline

In [None]:
baseline_runtime = ""

for runtime in open('output/baseline/bambu-log').readlines():
    if "Average execution" in runtime:
        baseline_runtime = [int(s) for s in runtime.split() if s.isdigit()][0]

print("Average execution in cycles: {}".format(baseline_runtime))

Visualize [Intermediate Dot File](output/baseline/HLS_output/dot/main_kernel/HLS_STGraph.dot)

### Optimized kernel

In [None]:
! scripts/run-bambu.sh optimized
# Takes aprox 30 seconds to execute

In [None]:
optimized_runtime = ""

for runtime in open('output/optimized/bambu-log').readlines():
    if "Average execution" in runtime:
        optimized_runtime = [int(s) for s in runtime.split() if s.isdigit()][0]

print("Average execution in cycles: {}".format(optimized_runtime))


Visualize [Intermediate Dot File](output/optimized/HLS_output/dot/main_kernel/HLS_STGraph.dot)

## Comparison of runtime results

* Display runtime
* Display [verilog output file](output/optimized/main_kernel.v)

In [None]:
print("Average execution in cycles of Baseline kernel:  {}".format(baseline_runtime))
print("Average execution in cycles of Optimized kernel: {}".format(optimized_runtime))
print("Speedup: {:.1f}".format(float(baseline_runtime/optimized_runtime)))

# Commandline interface

To visualize all possible paramenters for our optimization passes run:

- `soda-opt -h`

```
      --soda-opt-pipeline-for-bambu                    
        --affine-tile-size=<ulong>                     
        --bitwidth-of-index-type=<uint>                
        --max-alloc-size-in-bytes=<uint>               
        --max-rank-of-allocated-memref=<uint>          
        --number-of-full-unrolls=<uint>                
        --permutation-map=<uint>                       
        --use-bare-ptr-memref-call-conv                
        --no-alloca-promotion                          
        --no-buffer-trick                              
        --no-scalar-replacement                        
  
```

In [None]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt -h 2>&1 | cat > output/helpfile
)

Open [help file](output/helpfile)

### Modifying the number of unrolls

In [None]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode \
    -soda-opt-pipeline-for-bambu="use-bare-ptr-memref-call-conv number-of-full-unrolls=1" \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04optimized.mlir \
    2>&1 | cat > output/05intermediate-optimized.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04optimized.mlir \
    -o output/05optimized.ll
)

Visualize [intermediate file](output/05intermediate-optimized.mlir)

In [None]:
! scripts/run-bambu.sh optimized

### Default optimization pipeline (again)

Three full unrolls of the inner loop yield better latency for this kernel.

In [None]:
%%bash
(
  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  soda-opt \
    -soda-outline-bambu-code \
    -soda-extract-arguments-to-xml=using-bare-ptr \
    -soda-generate-bambu-accelcode \
    -soda-opt-pipeline-for-bambu=use-bare-ptr-memref-call-conv \
    -mlir-print-ir-after-all \
    output/01searched-edited.mlir \
    -o output/04optimized.mlir \
    2>&1 | cat > output/05intermediate-optimized.mlir

  docker run -u $(id -u) -v $(pwd):/working_dir --rm agostini01/soda \
  mlir-translate -opaque-pointers=0  \
    --mlir-to-llvmir \
    output/04optimized.mlir \
    -o output/05optimized.ll
)

In [None]:
! scripts/run-bambu.sh optimized

![tutorial-flow](imgs/flow-diagram.png)

# OpenRoad Flow: Automatic ASIC place and route (Step 6)

### Baseline kernel

In [None]:
! scripts/run-openroad.sh baseline

# Approx. 4min to execute

### Optimized kernel

In [None]:
! scripts/run-openroad.sh optimized

# Approx. 23min to execute

## Comparison of synthesis results

* Display area
* Display power
* Calculate and display FLOPS/W

In [None]:
log_path_suffix='HLS_output/Synthesis/bash_flow/openroad/logs/nangate45/main_kernel/base/6_report.log'
gds_path_suffix='HLS_output/Synthesis/bash_flow/openroad/results/nangate45/main_kernel/base/6_final.gds'

### Baseline

In [None]:
baseline_log='output/baseline/'+log_path_suffix

In [None]:
power_multiplier = 1 # Open road reports power in W

log_file=baseline_log
total_power = ()

for l in open(log_file, 'r').readlines():
  if ("Total" in l and "Group" not in l):
    total_power=float(l.split()[4])*power_multiplier

  if ("Design area" in l):
    available_area=float(l.split()[2])
    utilization_area=float(l.split()[4].strip('%'))
  

print('Baseline accelerator:')
print('  total power consumption: {}W'.format(total_power))
print('  available chip area: {} um^2'.format(available_area))
print('  utilized chip area: {}%'.format(utilization_area))


baseline_total_power=total_power
baseline_available_area=available_area
baseline_utilization_area=utilization_area


### Optimized for runtime

In [None]:
optimized_log='output/optimized/'+log_path_suffix

In [None]:
log_file=optimized_log
total_power = ()

for l in open(log_file, 'r').readlines():
  if ("Total" in l and "Group" not in l):
    total_power=float(l.split()[4])*power_multiplier

  if ("Design area" in l):
    available_area=float(l.split()[2])
    utilization_area=float(l.split()[4].strip('%'))
  

print('Optimized accelerator:')
print('  total power consumption: {}W'.format(total_power))
print('  available chip area: {} um^2'.format(available_area))
print('  utilized chip area: {}%'.format(utilization_area))

optimized_total_power=total_power
optimized_available_area=available_area
optimized_utilization_area=utilization_area

## Post place and route comparison

Considering a matrix multiply kernel has approximatelly 2xNxMxK arithmetic operations

And our selected kernel has the following sizes: 

```mlir
linalg.batch_matmul ins(%A, %B : memref<1x4x8xf32>, memref<1x8x4xf32>) 
                    outs(%C : memref<1x4x4xf32>)

```
M=4, K=8, N=4

We have approximatelly **256** floating point aritihmetic operations

In [None]:
giga_multiplier=1e9
flop_count = 256 # arithmetic float point operations
target_frequency = 200e+6 # 200MHz

optimized_runtime_in_s = optimized_runtime/target_frequency
baseline_runtime_in_s = baseline_runtime/target_frequency 

baseline_flops_per_watt= flop_count/baseline_runtime_in_s/baseline_total_power
optimized_flops_per_watt= flop_count/optimized_runtime_in_s/optimized_total_power


print("Execution in cycles of Baseline kernel:  {}".format(baseline_runtime))
print("Execution in cycles of Optimized kernel:   {}".format(optimized_runtime))

print("Speedup: \t\t\t{:.2f}x".format(baseline_runtime/optimized_runtime))
print("Area utilization overhead: \t {:.2f}x".format(optimized_utilization_area/baseline_utilization_area))
print("Area overhead: \t\t\t {:.2f}x".format(optimized_available_area/baseline_available_area))
print("Power overhead: \t\t {:.2f}x".format(optimized_total_power/baseline_total_power))

print("Baseline  \t\t\t {:.2f} GFLOPS/W ".format(baseline_flops_per_watt/giga_multiplier))
print("Optimized \t\t\t{:.2f} GFLOPS/W".format(optimized_flops_per_watt/giga_multiplier))


## Generated GDSII files

Output files can be found here:

* output/baseline/HLS_output/Synthesis/bash_flow/openroad/results/nangate45/main_kernel/base/6_final.gds
* output/optimized/HLS_output/Synthesis/bash_flow/openroad/results/nangate45/main_kernel/base/6_final.gds

### Baseline and Optimized Side by Side

![Side-By-Size](imgs/gds-side-by-side.png)

# Thank you!