# Model Network Partition

In this section, let us have a more detailed look at the `Network` and `Partition` classes.

In [1]:
from fpgaconvnet.parser.Parser import Parser

onnx_path = "../3.1_model_onnx_parser/fp16/vgg16_bn.onnx"
parser = Parser(custom_onnx=True, batch_size=1)
net = parser.onnx_to_fpgaconvnet(onnx_path)



To obtain the network performance estimation, please run the following code. Here we assume that we are using a single FPGA device clocked @200 MHz.

In [2]:
print("Number of partitions:", len(net.partitions))

multi_device = False
freq = 200 # MHz
reconf_time = 0.08255 # second

cycles = int(net.get_cycle(multi_device))
latency = net.get_latency(freq, multi_device, reconf_time)
throughput = net.get_throughput(freq, multi_device, reconf_time)

print("Cycles: ", cycles)
print("Latency (seconds): ", latency)
print("Throughput (frames per second): ", throughput)

Number of partitions: 1
Cycles:  199901582
Latency (seconds):  0.9995079149121093
Throughput (frames per second):  1.0004923273548403


Emmm... The design seems to be a bit slow, but don't worry this is just an example to show you how to obtain these estimations at network level. In fact, for the `net` object generated by `Parser`, the computation inside each layer is set as fully sequential mode. We'll show how to improve the performance in the later part of the tutorial.

Looking at the [`net.get_cycle`](https://github.com/AlexMontgomerie/fpgaconvnet-model/blob/dev-petros/fpgaconvnet/models/network/Network.py#L121) function, since we are sequentially scheduling partitions are a single deive, the total network cycle the sum of invidual partitions.

On the other hand, to obtain the resource estimiation, we can use the following code:

In [3]:
for i, partition in enumerate(net.partitions):
    print(f"{i}, resource: ", partition.get_resource_usage())

0, resource:  {'FF': 38956, 'LUT': 63555, 'DSP': 17, 'BRAM': 15002, 'URAM': 0}


Oh!!! That's a lot of BRAM, and why is that? In the default setup of streaming, dataflow accelerator stores all the weights on-chip all the time. For the VGG16 model that we are looking at, it has around 15M parameters. When the model is quantized to W16A16, the required memory size will be roughly 244Mb. Given that each Xilinx BRAM has the capacity of 18Kb, 244Mb/18Kb=13000 BRAMs plus there will some overhead elsewhere.

Again there is no need to worry about this at the moment. We can either use a smaller model, or from hardware persepective, fpgaConvNet supports [device reconfiguration](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/partition.py), [weights reloading](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/transforms/weights_reloading.py) and [weights streaming](https://github.com/AlexMontgomerie/fpgaconvnet-optimiser/blob/dev-petros/fpgaconvnet/optimiser/solvers/greedy_partition.py#L366) to deal with this problem, and we will talk about them later. 

In addition, change the precision of the model will also make a difference. Now let's load the onnx model which is annotate with BFP8 quantization.

In [4]:
onnx_path = "../3.1_model_onnx_parser/bfp8/vgg16_bn.onnx"
parser = Parser(custom_onnx=True, batch_size=1)
net = parser.onnx_to_fpgaconvnet(onnx_path)
for i, partition in enumerate(net.partitions):
    print(f"{i}, resource: ", partition.get_resource_usage())

0, resource:  {'FF': 33187, 'LUT': 48492, 'DSP': 31.5, 'BRAM': 7556, 'URAM': 0}
