# Optimiser Transform

From this section, we start looking at the [fpgaconvnet-optimiser](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser) which is able to automatically perform Design Space Exploration (DSE) based on the predictions from [fpgaconvnet-model](https://github.com/AlexMontgomerie/fpgaconvnet-model) and identify the optimal acclerator configuration.

In the optimiser repository, we introduce `transform`s to manipulate the accelerator configuration to trade resources for better performace:

[`partition`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/partition.py), which can split/merge the partitions in a network.

[`coarse`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/coarse.py), which controls the number of parallel data stream in/out for a given layer.

[`fine`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/fine.py), which changes the parallelism on the kernel dimension of convolutional layers.

[`weights_reloading`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/weights_reloading.py), which repeats the computation of a parition but multiplexes different weights, only affecting the last convolutional/fully-connected layer in a parition.

Next, let's see some examples of using these `transform`s.

In [8]:
from fpgaconvnet.parser.Parser import Parser

onnx_path = "../3.1_model_onnx_parser/fp16/vgg16_bn.onnx"
parser = Parser(custom_onnx=True, batch_size=1)
net = parser.onnx_to_fpgaconvnet(onnx_path)

for i, partition in enumerate(net.partitions):
    print(f"{i}, resource: ", partition.get_resource_usage())

0, resource:  {'FF': 38956, 'LUT': 63555, 'DSP': 17, 'BRAM': 15002, 'URAM': 0}


The [`split_complete`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/partition.py#L242) function is responsible for generating partitions, each containing a single layer. For a device setup, partitions are scheduled sequentially in a time-multiplexed manner, so the actual resource requirement is significantly reduced.

In [9]:
import fpgaconvnet.optimiser.transforms as transforms

transforms.partition.split_complete(net, None)
for i, partition in enumerate(net.partitions):
    transforms.weights_reloading.remove_weights_reloading_transform(partition)
    partition.update()
    print(f"{i}, resource: ", partition.get_resource_usage())

0, resource:  {'FF': 2108, 'LUT': 4206, 'DSP': 1, 'BRAM': 4, 'URAM': 0}
1, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
2, resource:  {'FF': 2195, 'LUT': 4197, 'DSP': 1, 'BRAM': 45, 'URAM': 0}
3, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
4, resource:  {'FF': 504, 'LUT': 355, 'DSP': 0, 'BRAM': 2, 'URAM': 0}
5, resource:  {'FF': 2213, 'LUT': 4287, 'DSP': 1, 'BRAM': 79, 'URAM': 0}
6, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
7, resource:  {'FF': 2239, 'LUT': 4287, 'DSP': 1, 'BRAM': 153, 'URAM': 0}
8, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
9, resource:  {'FF': 524, 'LUT': 349, 'DSP': 0, 'BRAM': 4, 'URAM': 0}
10, resource:  {'FF': 2276, 'LUT': 4465, 'DSP': 1, 'BRAM': 295, 'URAM': 0}
11, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
12, resource:  {'FF': 2305, 'LUT': 4465, 'DSP': 1, 'BRAM': 585, 'URAM': 0}
13, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
14, reso

You may find the BRAM requirement in some partitions is stll quite high, as we cannot split one layer into multiple partitions. Instead, the [`apply_max_weights_reloading`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/8a2487a2ecf6b59af3352af8ab78a44a1f443f05/fpgaconvnet/optimiser/transforms/weights_reloading.py#L34) function can further reduce the BRAM usage, by only storing the weights of a single filter at a time.

In [10]:
for i, partition in enumerate(net.partitions):
    transforms.weights_reloading.apply_max_weights_reloading(partition)
    partition.update()
    print(f"{i}, resource: ", partition.get_resource_usage())

0, resource:  {'FF': 2090, 'LUT': 4120, 'DSP': 1, 'BRAM': 1, 'URAM': 0}
1, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
2, resource:  {'FF': 2177, 'LUT': 4111, 'DSP': 1, 'BRAM': 9, 'URAM': 0}
3, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
4, resource:  {'FF': 504, 'LUT': 355, 'DSP': 0, 'BRAM': 2, 'URAM': 0}
5, resource:  {'FF': 2177, 'LUT': 4111, 'DSP': 1, 'BRAM': 7, 'URAM': 0}
6, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
7, resource:  {'FF': 2202, 'LUT': 4111, 'DSP': 1, 'BRAM': 10, 'URAM': 0}
8, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
9, resource:  {'FF': 524, 'LUT': 349, 'DSP': 0, 'BRAM': 4, 'URAM': 0}
10, resource:  {'FF': 2202, 'LUT': 4111, 'DSP': 1, 'BRAM': 8, 'URAM': 0}
11, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
12, resource:  {'FF': 2231, 'LUT': 4111, 'DSP': 1, 'BRAM': 11, 'URAM': 0}
13, resource:  {'FF': 35, 'LUT': 16, 'DSP': 0, 'BRAM': 0, 'URAM': 0}
14, resource: 

After applying [`split_complete`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/partition.py#L242) and [`apply_max_weights_reloading`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/8a2487a2ecf6b59af3352af8ab78a44a1f443f05/fpgaconvnet/optimiser/transforms/weights_reloading.py#L34), we obtain the accelerator configuratoin that requires minimal resource utilization, which is referred to the "resource-minimal" status in the DSE process.

In [11]:
conv_0_layer = net.partitions[0].graph.nodes["Conv_0"]["hw"]

print("coarse_in: ", conv_0_layer.coarse_in)
print("coarse_out: ", conv_0_layer.coarse_out)
print("coarse_group: ", conv_0_layer.coarse_group)
print("fine: ", conv_0_layer.fine)
print("Latency (cycle):", conv_0_layer.latency())
print(conv_0_layer.resource())

coarse_in:  1
coarse_out:  1
coarse_group:  1
fine:  1
Latency (cycle): 27648
{'LUT': 4120, 'FF': 2090, 'DSP': 1, 'BRAM': 1, 'URAM': 0}


The [`apply_random_coarse_node`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/coarse.py#L20) function randomly modifies the `coarse_in`, `coarse_out` and `coarse_group` attributes of the given layer. 

In [12]:
while conv_0_layer.coarse_in * conv_0_layer.coarse_out * conv_0_layer.coarse_group == 1:
    transforms.apply_random_coarse_node(net.partitions[0], "Conv_0")
    net.partitions[0].update()

print("coarse_in: ", conv_0_layer.coarse_in)
print("coarse_out: ", conv_0_layer.coarse_out)
print("coarse_group: ", conv_0_layer.coarse_group)
print("fine: ", conv_0_layer.fine)
print("Latency (cycle):", conv_0_layer.latency())
print(conv_0_layer.resource())

coarse_in:  3
coarse_out:  1
coarse_group:  1
fine:  1
Latency (cycle): 9216
{'LUT': 6748, 'FF': 3303, 'DSP': 3, 'BRAM': 3, 'URAM': 0}


The [`apply_complete_fine`](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/fine.py) function maxize the `fine` attribute for all the convolutional layers in the given partition, which is equal to fully unrolling their comutation in the kernel dimension.

In [13]:
transforms.apply_complete_fine(net.partitions[0])
net.partitions[0].update()

print("coarse_in: ", conv_0_layer.coarse_in)
print("coarse_out: ", conv_0_layer.coarse_out)
print("coarse_group: ", conv_0_layer.coarse_group)
print("fine: ", conv_0_layer.fine)
print("Latency (cycle):", conv_0_layer.latency())
print(conv_0_layer.resource())

coarse_in:  3
coarse_out:  1
coarse_group:  1
fine:  9
Latency (cycle): 1156
{'LUT': 2519, 'FF': 3113, 'DSP': 27, 'BRAM': 3, 'URAM': 0}


You can observe how resource and latency change after these `transform`s applied.