-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNN Synthesis fail : ERROR: [XFORM 203-504] Stop unrolling loop 'Product1' (firmware/nnet_utils/nnet_dense_latency.h:37) in function 'nnet::dense_latency<ap_fixed<2, 1, (ap_q_mode)5, (ap_o_mode)3, 0>, ap_fixed<2, 1, (ap_q_mode)5, (ap_o_mode)3, 0>, config6>' because it may cause large runtime and excessive memory usage due to increase in code size. Please avoid unrolling the loop or form sub-functions for code in the loop body. ERROR: [HLS 200-70] Pre-synthesis failed. #1013
Comments
This is not a bug - when loops are unrolled, operations are parallelised. However, there is a limit to how much can be executed in parallel (due to available resources, critical path and sometimes compiler issues with scheduling). To address this problem, use the In general with Vivado HLS (hls4ml's Vivado backend), the limit is 4,096 so the most the reuse factor should be at least that Alternatively, you could consider also using Vitis HLS which should have better parallelism behaviour (but fully unrolling is still unlikely) |
Hi, thank you for your response. However, I encountered is this issue with a reuse factor of 1,000,000, using the strategy "Resources" and an approximately 10,000-parameter CNN. If these settings are not viable, what can I do to resolve this issue? |
Make sure to use |
Furthermore, there is a typo - the strategy is |
Yes, my bad. I corrected it. Is there a way to estimate the complexity of the process for a network to avoid attempting 'Latency' mode when it is clearly infeasible? A 10K-parameter CNN is relatively small, and I'd like to understand what kind of network would require caution when unrolling and optimizing for performance. |
The HLS will fail automatically when there are more than 4,096 unrolls (i.e. parallel multiplications). This parameter is hard-coded and generally not exposed in hls4ml. It was determined heuristically - the conclusion was that after 4,096 the HLS compiler has issue with scheduling and completing the compilation. Now you can modify this variable in So as long as every layer has less multiplications than the threshold, you will not get this error. Looking at the error you posted, the last dense layer has 7,200 parameters. In this case you can either:
|
Thanks for your help ! |
The reuse factor is capped at the number of parameters for each layer, correct? I have valid reuse factors ranging from 1, 2, ..., to 18432 for a conv2D layer (15,15,32) in (13,13,64) out. However, why is 1 considered a valid reuse factor if the number of parameters in the layer divided by the reuse factor is greater than 4,096? |
Additionally, I'm curious why even with a large reuse factor like 1000000, the synthesis fails when in "Latency" mode. Wouldn't it be possible to select the maximum reuse factor for each layer instead?" |
'valid' means it has to be divide the number of parameters/multiplications in the layer - this is to ensure QoR (otherwise the last clock cycle will have less multiplications than others and this will lead to unequal load, unused hardware, scheduling overhead etc. The idea is that in every clock cycle you do the same number of multiplications as total_layer_mult / layer_reuse_factor). On the other hand, 4,096 is a Vivado HLS concept (not hls4ml, so if you use a different backend, e.g. Quartus targeting Intel boards it might have a different limit). So when you unroll more than 4,096 it is still valid it just means the HLS compiler (not hls4ml) has issues scheduling such high parallelism.
Latency has a different implementation. All the loops are unrolled but the multiplications are limited using another pragma. Therefore, the design will be scheduled such that it completes in reuse_factor clock cycles (plus some constant offset), but the loops are treated as unrolled in the HLS code. |
Thank you! Can you provide more details on how the Latency mode works? I read the paper and the documentation but didn't find the information I was looking for. Is there a place where I can find answers without having to delve deeply into the code? |
Unfortunately, there is no more documentation than the tutorials, website and papers. Please have a look here: https://github.com/fastmachinelearning/hls4ml/blob/main/hls4ml/templates/vivado/nnet_utils/nnet_dense_latency.h. The code is short and the most important part of it is the pragma: In the Latency strategy, all model weights are stored in registers and all the loops (there are only 2) are fully unrolled. However, the implementation still takes more than one clock cycle, determined by the reuse factor. Typically the latency is reuse_factor clock cycles + some overhead (invoking the layer, adding bias, typically 2 clock cycles) In this case, the number of parallel multiplications is limited through the Importantly: Both convolutional and recurrent layers are implemented using (require) matrix multiplication, so they build on top of the dense layers (which are essentially matrix multiplication) |
I am working on a script designed to forecast FPGA resource usage (specifically DSPs and BRAMs) for HLS4ML. The goal is to estimate resource usage and throughput without synthesizing every model in HLS. My script also checks each layer to see if the number of parameters and multiply-accumulate operations (MACCs) divided by the reuse factor exceeds 4096, issuing a warning if this threshold is surpassed. @vloncar @bo3z I know you guys have already answered similars questions if you have time to look at this, it would be pretty awesome :) However, when I run this script on CNNs with 10k, 100k, and 1M parameters that fail to synthesize due to extensive unrolling, I do not receive any warnings. The only successful synthesis method I've found is using IO streams and resource constraints. Can you help me understand what I might be missing in my script that causes it to miss these warnings? Additionally, I would appreciate any suggestions or improvements to enhance my script. import argparse
import json
import numpy as np
import torch
from tensorflow.keras.models import model_from_json, load_model
from tensorflow.keras.layers import Dense, Conv2D, Flatten, LSTM, BatchNormalization, GRU, Activation, MaxPooling2D
import math
from tabulate import tabulate
from colorama import Fore, Style
def format_table_with_colors(data):
"""
Formats and colors the data table for better visibility when displayed.
Uses Colorama library colors to highlight keys.
"""
formatted_data = []
for key, value in data.items():
formatted_data.append([Fore.GREEN + key + Style.RESET_ALL, value])
return formatted_data
# Configuration dictionary to handle all global settings
config = {
"freq_mhz": 250,
"DSP_max": 2520,
"DSP_used": 0,
"reuse_factor": 2,
"mem_bandwidth_gbps": 10,
"BRAM_capacity_bits": 1024 * 36,
"BRAM_used": 0,
"BRAM_max": 32.1 * 10**6,
"weight_bits": 32,
"info_layer": {},
"throughput": float('inf'),
"batch_size": 32,
"io_type": "stream"
}
def estimations_keras(model):
"""
Estimates FPGA resource needs and performance metrics for each layer in a Keras model.
Calculates DSP and BRAM requirements and adjusts throughput based on resource utilization.
"""
total_macc = sum(macc_per_layer_keras(layer) for layer in model.layers)
for layer in model.layers:
if config["DSP_used"] < config["DSP_max"]:
DSP_needed = math.ceil(config["info_layer"][layer.name]['macc'] / config["reuse_factor"])
config["DSP_used"] += DSP_needed
if config["info_layer"][layer.name]['macc'] != 0:
new_throughput = math.floor(DSP_needed * config["freq_mhz"] * 10**6 / config["info_layer"][layer.name]['macc'])
if config["throughput"] > new_throughput:
config["throughput"] = new_throughput
if config["BRAM_used"] < config["BRAM_max"] / config["BRAM_capacity_bits"]:
config["BRAM_used"] += math.ceil(config["info_layer"][layer.name]['params'] * config["weight_bits"] / config["BRAM_capacity_bits"])
if config["io_type"] == "stream":
config["BRAM_used"] += math.ceil(np.prod(layer.output.shape[1:3]) * config["weight_bits"] / (config["BRAM_capacity_bits"] * config["reuse_factor"]))
formatted_table = format_table_with_colors({'Throughput': config["throughput"], 'DSP': config["DSP_used"], 'BRAM': config["BRAM_used"]})
print(tabulate(formatted_table))
def macc_per_layer_keras(layer):
"""
Calculates the multiply-accumulate operations (MACCs) for each type of layer in a Keras model.
"""
if isinstance(layer, (Conv2D, Dense, LSTM, GRU, BatchNormalization)):
return layer_specific_maccs(layer)
elif isinstance(layer, MaxPooling2D):
config["info_layer"][layer.name] = {'macc': 0, 'params': 0}
return 0
elif isinstance(layer, Activation) or "activation" in layer.__class__.__name__.lower():
return estimate_activation_maccs(layer)
elif isinstance(layer, Flatten):
config["info_layer"][layer.name] = {'macc': 0, 'params': 0}
return 0
return 0
def layer_specific_maccs(layer):
"""
Detail specific calculations per layer type, including Conv2D, Dense, LSTM, GRU, BatchNormalization.
"""
output_area = np.prod(layer.output.shape[1:3])
if isinstance(layer, Conv2D):
macc_per_output = np.prod(layer.kernel_size) * layer.filters
elif isinstance(layer, Dense):
macc_per_output = layer.units
elif isinstance(layer, LSTM) or isinstance(layer, GRU):
input_dims = layer.input.shape[-1]
num_units = layer.units
macc_per_output = 4 * num_units if isinstance(layer, LSTM) else 3 * num_units
macc_per_output *= (input_dims + num_units + 1) # +1 for bias
elif isinstance(layer, BatchNormalization):
return 2 * output_area # Normalize and scale
total_macc = output_area * macc_per_output
if layer.use_bias:
total_macc += output_area
config["info_layer"][layer.name] = {'macc': total_macc, 'params': layer.count_params()}
return total_macc
def estimate_activation_maccs(layer):
"""
Estimates the MACCs for activation functions based on their complexity.
"""
output_elements = np.prod(layer.output.shape[1:])
if layer.activation.__name__ in ['softmax', 'sigmoid', 'tanh']:
return 5 * output_elements # Approximation for expensive operations
elif layer.activation.__name__ == 'relu':
return 0 # Simple max(0, x)
return 0 # Direct computation for other activations
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Estimate performance metrics for a Keras or PyTorch model on FPGA")
parser.add_argument("--model_file", help="Path to the model file (JSON + H5)")
parser.add_argument("--framework", choices=["keras", "pytorch"], default="keras", help="Choose the framework (Keras or PyTorch)")
parser.add_argument("--batch_size", type=int, default=32, help="Batch size for processing")
parser.add_argument("--quantization", type=int, default=32, help="Number of bits needed to represent a weight")
parser.add_argument("--reuse_factor", type=int, default=2, help="Number of DSP reuses")
parser.add_argument("--frequency", type=int, default=250, help="Operating frequency in MHz")
parser.add_argument("--io_type", default="stream", help="IO type: stream or parallel")
args = parser.parse_args()
# Example of how to load and use the model depending on the framework
if args.framework == "keras":
model = model_from_json(open(args.model_file + ".json").read())
try:
model.load_weights(args.model_file + ".h5")
except OSError:
model.load_weights(args.model_file + "_weights.h5")
model.summary()
config.update({
"batch_size": args.batch_size,
"reuse_factor": args.reuse_factor,
"freq_mhz": args.frequency,
"weight_bits": args.quantization,
"io_type": args.io_type
})
estimations_keras(model)
else:
# Placeholder for PyTorch functionality
print("PyTorch estimation functionality not implemented.") |
Environment
For all the pip dependencies, refer to the env.txt file.
env.txt
Quick Summary
I attempted to synthesize some CNN models and all attempts failed. After removing layers one by one, I identified Conv2D as the problematic layer due to excessive memory usage and large runtime. MLPs, however, work perfectly. I attempted synthesis with all NN configurations using the following scripts.
Test_sript.txt
example.py.txt
I hope the conversion from .py or .yaml to .txt will not create to many artefacts and bugs
Details
Steps to Reproduce
Warning: All provided files are in .txt format; you will need to change the extension to .py or .yml for the process to work. Also, remember to modify the path to your own file locations.
To replicate the bug, follow these steps:
env.txt
.Test_script.txt
toTest_script.py
and execute it to create the models to be synthesized.yaml.txt
configuration file to include the CNN model you wish to synthesize.example.py
script to attempt synthesis.Step 1: Install dependencies from env.txt
Step 2: Execute Test_script to create the models
Step 3: Update yml configuration to include the CNN model
Update this yaml to try the desired NN ( H5 + JSON) :
yaml.txt
Step 4: Run the example script
Expected behavior
It should synthetize the CNN.
Actual behavior
Possible fix
I saw someone talking about hls4ml version 0.3.0 who claimed to solve a similar problem but didn't try it because i expect this repository to be stable.
The text was updated successfully, but these errors were encountered: