





### High Level Synthesis (HLS)



- ◆ A process of converting high-level description of a design to a netlist
  - > Input:
    - High-level languages (ex: C/C++/System C)
    - Behavioral hardware description languages (ex: Verilog, VHDL)
    - Structural HDLs (ex: Verilog, VHDL)
    - State diagrams & logic networks
  - > Tools:
    - Parser
    - · Library of modules
  - Constraints:
    - · Area constraints (ex: # modules of a certain type)
    - Delay constraints (ex: set of operations should finish in  $\pmb{\lambda}$  clock cycles)
  - > Output:
    - Operation scheduling (time) and binding (resource)
    - · Control generation and detailed interconnections

2021/3/16

Andy Yu-Guang Chen

3



### **Example of High-Level Optimization**



Applying the distributivity law to reduce resource requirement.





## **Internal Representation**



- ◆ Most systems use some form of a data-flow graph (DFG).
  - A DFG may or may not contain information on control flow.
- A data-flow graph is built from
  - Vertices (nodes): representing computation, and
  - Edges: representing precedence relations.





Unit 2





















### **Operation Types**



- ◆ For each operation, define its type
- ◆ For each resource, define a resource type and its delay (in terms of # cycles)
- ◆ T is a relation that maps an operation to a resource type that can implement it
  - > T: V → {1, 2, ...,  $n_{res}$ }.
- ◆ More general case:
  - ➤ A resource type may implement more than one operation type (ex: ALU)
- ◆ Resource binding:
  - Map each operation to a resource with the same type
  - Might have multiple options



2021/3/16

Andy Yu-Guang Chen

15



### Schedule in Spatial Domain



- ◆ Resource sharing
  - More than one operation is bound to same resource
  - Operations have to be serialized
  - Can be represented using hyperedges (define vertex partition)





2021/3/16

Andy Yu-Guang Chen





### HLS Subtasks: Allocation, Scheduling, Assignment

- ◆ Subtasks in high-level synthesis
  - Allocation (Module selection): specify the hardware resources that will be necessary.
  - Scheduling: determine for each operation the time at which it should be performed such that no precedence constraint is violated.
  - Assignment (Binding): map each operation to a specific functional unit and each variable to a register.
- ◆ Remarks:
  - ➤ Though the subproblems are strongly interrelated, they are often solved separately. However, to attain a better solution, an iterative process executing these three subtasks must be performed.



Most scheduling problems are NP-complete ⇒ heuristics are used.





### Scheduling and Binding

- Resource constraints:
  - Number of resource instances of each type  $\{a_k: k=1, 2, ..., n_{res}\}$
- Scheduling:
  - $\triangleright$  Labeled vertices  $\phi(v_3)=1$
- ♦ Binding:
  - $\triangleright$  Hyperedges (or vertex partitions)  $\beta(v_2)$ =adder<sub>1</sub>
- Cost
  - ➤ Number of resources ≈ area Resource dominated
  - Registers, steering logic (mux, bus), wiring, control unit Control dominated
- ◆ Delay:
  - > Start time of the "sink" node
  - Might be affected by steering logic and scheduling (control logic) resource-dominated vs. control-dominated



2021/3/16

Andy Yu-Guang Chen



### **Architectural Optimization**



- Optimization in view of design space flexibility
- ◆ A multi-criteria optimization problem:
  - $\triangleright$  Determine schedule  $\phi$  and binding  $\beta$ .
    - $\triangleright$  Under area A, latency  $\lambda$ , and cycle time  $\tau$  objectives
- ◆ Find non-dominated points in solution space
- ◆ Solution space tradeoff curves:
  - Non-linear, discontinuous
  - > Area, latency, cycle time (more?)
- ◆ Evaluate (estimate) cost functions
- Unconstrained optimization problems for resource dominated circuits:
  - Min area: solve for minimal binding
  - $\triangleright$  Min latency: solve for minimum  $\lambda$  scheduling

2021/3/16

Andy Yu-Guang Chen

19



### Scheduling and Binding



- igspace Cost  $\lambda$  and A determined by both  $\phi$  and  $\beta$ 
  - > Also affected by floorplan and detailed routing
- $\blacklozenge \beta$  affected by  $\phi$ :
  - Resources cannot be shared among concurrent operations
- $\blacklozenge \phi$  affected by  $\beta$ :
  - > Resources cannot be shared among concurrent operations
  - When register and steering logic delays added to execution delays, might violate cycle time
- ♦Order?
  - > Apply either one (scheduling, binding) first



2021/3/16

Andy Yu-Guang Chen



- ◆Hardware is normally partitioned into two parts:
  - **Data path:** a network of functional units, registers, multiplexers and buses.
    - The actual "computation" takes place in the data path.
  - ➤ Control: the part of the hardware that takes care of having the data present at the right place at a specific time, of presenting the right instructions to a programmable unit, etc.
- Often high-level synthesis concentrates on <u>data-</u> path synthesis.
  - ➤ The control part is then realized as a finite state machine or in microcode.



Unit 2

21



## How Is the Datapath Implemented?



- Assuming the following scheduling and binding
  - ➤ Wires between modules?
  - ➤ Input selection?
  - How do binding and scheduling affect congestion?
  - How do binding and scheduling affect steering logic?





2021/3/16

Andy Yu-Guang Chen



### **Operation Scheduling**



- ♦ Input:
  - Sequencing graph G(V, E), with n vertices
  - $\triangleright$  Cycle time  $\tau$
  - $\triangleright$  Operation delays  $D = \{d_i: i=0..n\}$
- ◆ Output:
  - $\triangleright$  Schedule  $\phi$  determines start time  $t_i$  of operation  $v_i$
  - $\triangleright$  Latency  $\lambda = t_n t_0$
- ♦ Goal: determine area & latency tradeoff
- ♦ Issues:
  - Non-hierarchical and unconstrained
  - > Latency constrained
  - Resource constrained
  - > Hierarchical



2021/3/16

Andy Yu-Guang Chen

2



## Min Latency Unconstrained Scheduling



- ◆ Simplest case: no constraints, find minimum latency
- ◆ Given set of vertices *V*, delays *D*, and partial order ≻ on operations *E*, find an integer labeling of operations

$$\phi: V \rightarrow Z^+$$
, such that:

- $> t_i = \phi(v_i)$
- $ightharpoonup t_i \ge t_i + d_i \qquad \forall (v_i, v_i) \in E$
- $\lambda = t_n t_0$  is minimum
- ◆ Solvable in polynomial time
- ◆ Bounds on latency for resource constrained problems
- ◆ <u>ASAP</u> algorithm used: topological order
- Applying the DFG algorithm to finding the longest path between the start and end nodes leads to the latency of the result

2021/3/16

Andy Yu-Guang Chen





















### **Constrained Scheduling**



- Constrained scheduling
  - ➤ General case NP-complete
  - Minimize latency, given constraints on area or the resources (ML-RCS)
  - Minimize resources subject to bound on latency (MR-LCS)
- Exact solution methods
  - > ILP: Integer Linear Programming
  - > Hu's heuristic algorithm for identical processors (operations)
- Heuristics
  - List scheduling
  - > Force-directed scheduling



2021/3/16

Andy Yu-Guang Chen

33



### **List Scheduling**



- ◆ Greedy algorithm for ML-RCS and MR-LCS
  - **▶** Does NOT guarantee optimum solution
- ◆Similar to Hu's algorithm
  - ➤ Operation selection decided by criticality
  - ➤O(n) time complexity
- ◆More general input
  - ➤ Resource constraints on different resource types



2021/3/16

Andy Yu-Guang Chen

```
List Scheduling Algorithm: ML-RCS
LIST_L (G(V,E), a) {
   l = 1
                                                必須知道parent狀態
   repeat {
       for each resource type k {
                                                                     要知道每個type正在進行
             U_{l,k} = available vertices in V \rightarrow "ready state"
                                                                     的數量
             T_{l,k} = operations in progress \rightarrow "ongoing state"
             Select S_k \subseteq U_{l,k} such that |S_k| + |T_{l,k}| \le a_k
             Schedule the S_k operations at step I
       1 = 1 + 1
   \} until v_n is scheduled
  2021/3/16
                             Andy Yu-Guang Chen
```



```
List Scheduling Algorithm: MR-LCS
LIST_R (G(V,E). \lambda') {
   a = 1, = 1
   Compute the ALAP times t
   if t_0^{L} < 0
         return (not feasible
   repeat {
         for each resource type k {
                U_{l,k} = available vertices in V \rightarrow "ready state"
                Compute the slacks \{s_i = \mathbf{t}_i^L - I, \forall v_i \in U_{l,k}\}
                Schedule operations with zero slack, update a
                Schedule additional S_k \subseteq U_{l,k} under a constraints
         }
         1 = 1 + 1
   \} until v_n is scheduled
                                   Andy Yu-Guang Chen
```





### Force-Directed Scheduling



- Similar to list scheduling
  - Can handle ML-RCS and MR-LCS
  - For ML-RCS, schedules step-by-step
  - > BUT, selection of the operations tries to find the globally best set of operations
- Idea time frame:
  - Find the mobility  $\mu_i = t_i^L t_i^S$  of operations
  - Look at the operation type probability distributions
  - > Try to flatten the operation type distributions
- Definition: operation probability density
  - $\triangleright p_i(1) = \Pr \{v_i \text{ starts at step } l\}$

Assume uniform distribution: 
$$p_i(l) = \frac{1}{\mu_i + 1} \quad for \ l \in [t_i^s, t_i^L]$$



2021/3/16

Andy Yu-Guang Chen



#### Force-Directed Scheduling: Definitions



Operation-type distribution (NOT normalized to 1)

$$q_k(l) = \sum_{i:T(v_i)=k} p_i(l)$$

Operation probabilities over control steps:

$$p_i = \{ p_i(0), p_i(1), \dots, p_i(n) \}$$

Distribution graph of type *k* over all steps:

$$\{q_k(0), q_k(1), \dots, q_k(n)\}$$

 $ightharpoonup q_k(\ l\ )$  can be thought of as *expected* operator cost for implementing operations of type k at step l.



2021/3/16

Andy Yu-Guang Chen







#### Self Force



- Sum of forces to feasible schedule steps
- igspace Self-force for operation  $v_i$  in step l

self – force
$$(i,l) = \sum_{m=l_i^S}^{l_i^L} q_k(m) (\delta_{lm} - p_i(m))$$

$$=q_k(l)-\frac{1}{\mu_i+1}\sum_{m=t}^{t_i^L}q_k(m)$$

$$\delta_{lm} = \begin{cases} 1, & \text{if } l = m \\ 0, & \text{if } l \neq m \end{cases}$$





2021/3/16

Andy Yu-Guang Chen





### Self Force Example: $v_6$



- Op v6 can be scheduled in the first two steps
  - p(1) = 0.5; p(2) = 0.5; p(3) = 0; p(4) = 0
  - $\triangleright$  Distribution: q (1) = 2.8; q (2) = 2.3
- ◆ Assign v6 to step 1:
  - $\triangleright$  variation in probability 1 0.5 = 0.5 for step 1
  - $\triangleright$  variation in probability 0 0.5 = -0.5 for step 2
  - ➤ Self-force: 2.8 \* 0.5 2.3 \* 0.5 = + 0.25
  - No successor force
- Assign v6 to step 2:
  - variation in probability 0 0.5 = -0.5 for step 1
  - variation in probability 1 0.5 = 0.5 for step 2
  - Self-force: -2.8 \* 0.5 + 2.3 \* 0.5 = -0.25
  - Successor-force:
  - $\triangleright$  Successor (v7) force is 2.3 \* (0 0.5) + 0.8 \* (1 0.5) = -0.75
  - ➤ Total force = -1
- 1\*0.8 0.5\*(2.3 + 0.8) = -0.75



2021/3/16

Andy Yu-Guang Chen



### Predecessor/successor Force



- ◆ Related to the predecessors/successors
  - Fixing an operation timeframe restricts timeframe of predecessors/successors
  - Ex: Delaying an operation implies delaying its successors

ps-force
$$(i,l) = \frac{1}{\widetilde{\mu}_i + 1} \sum_{m=\widetilde{t}_i^s}^{\widetilde{t}_i^L} q_k(m) - \frac{1}{\mu_i + 1} \sum_{m=t_i^s}^{t_i^L} q_k(m)$$



Andy Yu-Guang Chen







## P/S Force Example: $v_7 \& v_9$

- ◆ Type 1 (v7) distribution:
  - ightharpoonup q(1) = 2.8; q(2) = 2.3; q(3) = 0.8; q(4) = 0
- ◆ Assign v6 to step 2:
  - Time frame of v7 is reduced
  - → 1 \* (0.8) 0.5 \* (2.3 + 0.8) = -0.75
- ◆ Type 2 (v9) distribution:
  - ightharpoonup q(1) = 0.3; q(2) = 1; q(3) = 2; q(4) = 1.6
- ◆ Assign v8 to step 2:
  - > Time frame of v9 is reduced
  - $\triangleright$  0.5 \* (2 + 1.6) 0.3 \* (1 + 2 + 1.6) = 0.3



2021/3/16

Andy Yu-Guang Chen





### Force-Directed Scheduling: Algorithm



- ◆ Very similar to LIST L(G(V,E), a)
  - Compute mobility of operations using ASAP and ALAP
  - Select and schedule operations
  - Go to next control step
- ◆ Difference with list scheduling in selecting operations
  - Compute operation probabilities and type distributions
  - > Select operations with least force
  - Update operation probabilities and type distributions
  - Consider the effect on the type distribution
  - Consider the effect on p/s nodes and their type distributions
  - Complexity: O(n<sup>3</sup>)



2021/3/16

Andy Yu-Guang Chen







#### **Mixed Integer Linear Programming**



- A mathematical programming such that:
  - ➤ The objective is a linear function
  - > All constraints are linear functions
  - ➤ Some variables are real numbers and some are integers, i.e., "mixed integer"
- ◆It is almost like a linear programming, except that some variables are integers

NP-C problem



2021/3/16

Andy Yu-Guang Chen





#### **ILP Formulation of ML-RCS**



- Use binary decision variables
  - $\triangleright$  i = 0, 1, ..., n
  - $> l = 1, 2, ..., \lambda'+1$   $\lambda'$ : given upper-bound on latency
- $x_{i,l} = 1$  if operation i starts at step l, 0 otherwise. Set of linear inequalities (constraints), and an objective function (min latency)
- Observations

$$x_{i,l} = 0 \quad for \quad l < t_i^S \quad and \quad l > t_i^L \quad \text{feasibility}$$

$$(t_i^S = ASAP(v_i), t_i^L = ALAP(v_i))$$

- $t_i$  = start time of op i
- $t_i = \sum_{l} l \cdot x_{i,l}$   $t_i = \text{start time of op } i$   $\sum_{m=l}^{l} x_{i,m} = 1$ If op  $v_i$  takes  $d_i$  steps, is op  $v_i$  (still) executing at step l?



2021/3/16

Andy Yu-Guang Chen

55



#### Start Time vs. Execution Time



- lacktriangle For each operation  $v_i$ , only one start time
- lack If  $d_i = 1$ , then the following questions are the same:
  - $\triangleright$  Does operation  $v_i$  start at step l?
  - $\triangleright$  Is operation  $v_i$  running at step l?
- lacktriangle But if  $d_i > 1$ , then the two questions should be formulated as:
  - $\triangleright$  Does operation  $v_i$  start at step l?
    - Does  $x_{i,l} = I$  hold?
  - $\triangleright$  Is operation  $v_i$  running at step l?
    - Does  $\sum_{m=l-d_i+1}^{l} x_{i,m} = 1 \text{ hold?}$



2021/3/16

Andy Yu-Guang Chen







### ILP Formulation of ML-RCS (Cont.)



- **♦** Constraints:
  - Vinique start times:  $\sum_{l} x_{i,l} = 1$ , i = 0,1,...,n
  - Sequencing (dependency) relations must be satisfied

$$t_i \ge t_j + d_j \ \forall (v_j, v_i) \in E \Rightarrow \sum_l l \cdot x_{i,l} \ge \sum_l l \cdot x_{j,l} + d_j$$

> Resource constraints

$$\sum_{i:T(v_i)=k} \sum_{m=l-d_i+1}^{l} x_{i,m} \le a_k, \quad k = 1, \dots, n_{res}, \quad l = 1, \dots, \overline{\lambda} + 1$$

- lacktriangle Objective: min  $c^T t$ .
  - > t =start times vector, c =cost weight (ex: [0 0 ... 1])
  - When **c** = [0 0 ... 1],  $c^T t = \sum_{l} l \cdot x_{n,l}$



2021/3/16

Andy Yu-Guang Chen







### **ILP Example: Unique Start Times Constraint**

Without using ASAP and ALAP values:

$$x_{1,1} + x_{1,2} + x_{1,3} + x_{1,4} = 1$$
  
 $x_{2,1} + x_{2,2} + x_{2,3} + x_{2,4} = 1$ 

• • •

•••

...

$$x_{11,1} + x_{11,2} + x_{11,3} + x_{11,4} = 1$$

Using ASAP and ALAP:

$$x_{1,1}=1$$

$$x_{2,1} = 1$$

$$x_{3,2} = 1$$

$$x_{4.3} = 1$$

$$x_{5,4} = 1$$

$$x_{6.1} + x_{6.2} = 1$$

$$x_{7,2} + x_{7,3} = 1$$

$$x_{8,1} + x_{8,2} + x_{8,3} = 1$$

$$x_{9,2} + x_{9,3} + x_{9,4} = 1$$

• • •



2021/3/16

Andy Yu-Guang Chen

61





#### **ILP Example: Dependency Constraints**

Using ASAP and ALAP, the non-trivial inequalities are: (assuming unit delay for +

$$2 \cdot x_{7,2} + 3 \cdot x_{7,3} - 1 \cdot x_{6,1} - 2 \cdot x_{6,2} - 1 \ge 0$$

$$2 \cdot x_{9,2} + 3 \cdot x_{9,3} + 4 \cdot x_{9,4} - 1 \cdot x_{8,1} - 2 \cdot x_{8,2} - 3 \cdot x_{8,3} - 1 \ge 0$$

$$2 \cdot x_{11,2} + 3 \cdot x_{11,3} + 4 \cdot x_{11,4} - 1 \cdot x_{10,1} - 2 \cdot x_{10,2} - 3 \cdot x_{10,3} - 1 \ge 0$$

$$4 \cdot x_{5/4} - 2 \cdot x_{7/2} - 3 \cdot x_{7/3} - 1 \ge 0$$

$$5 \cdot x_{n.5} - 2 \cdot x_{9.2} - 3 \cdot x_{9.3} - 4 \cdot x_{9.4} - 1 \ge 0$$

$$5 \cdot x_{n,5} - 2 \cdot x_{11,2} - 3 \cdot x_{11,3} - 4 \cdot x_{11,4} - 1 \ge 0$$



2021/3/16

Andy Yu-Guang Chen





#### **ILP Example: Resource Constraints**

Resource constraints (assuming 2 adders and 2 multipliers)  $x_{1,1} + x_{2,1} + x_{6,1} + x_{8,1} \le 2$ 

$$x_{3,2} + x_{6,2} + x_{7,2} + x_{8,2} \le 2$$

$$x_{7,3} + x_{8,3} \le 2$$

$$x_{10.1} \le 2$$

$$x_{9,2} + x_{10,2} + x_{11,2} \le 2$$

$$x_{4,3} + x_{9,3} + x_{10,3} + x_{11,3} \le 2$$

$$x_{5,4} + x_{9,4} + x_{11,4} \le 2$$

- ◆Objective:
  - Since  $\lambda$ =4 and sink has no mobility, any feasible solution is optimum, but we can use the following anyway:  $Min = \frac{1}{2} \cdot x_{n,1} + \frac{1}{2} \cdot x_{n,2} + \frac{1}{4} \cdot x_{n,4}$



2021/3/16

Andy Yu-Guang Chen

63







- ◆ Dual problem to ML-RCS
- ♦Objective:
  - ➤ Goal is to optimize total resource usage, a.
  - ightharpoonup Objective function is  $c^Ta$ , where entries in c are respective area costs of resources
- **◆**Constraints:
  - ➤ Same as ML-RCS constraints, plus:
  - > Latency constraint added:

$$\sum_{l} l \cdot x_{n,l} \le \overline{\lambda} + 1$$

Note: unknown a<sub>k</sub> appears in constraints.



2021/3/16

Andy Yu-Guang Chen







# **Effectiveness of BRT**



| From Synopsys | LITECTIVE   |
|---------------|-------------|
|               | RTL-DESIGN- |

| Design Type  Control | RTL DESIGN: |              | V BEHAVIORAL RETIMING      |              |                           |
|----------------------|-------------|--------------|----------------------------|--------------|---------------------------|
|                      |             | GATES        | SPEED (CLOCE PERIOD) GATES |              | SUMMARY                   |
|                      |             | 10,913 gates | 30.6 ns                    | 11,313 gates | 30% faster, 4% more area  |
| Control              | 23.1 ns     | 3,598-gates  | 19.6-ns                    | 4,575 gates  | 15%-faster, 27%-more area |
| Control              | 28.6 ns     | 3,585-gates  | 28.6 ns                    | 3,359 gates  | same-speed, 6% less-area  |
| Dataflow& Control    | 17-ns       | 28,900-gates | 12.5 ns                    | 30,100 gates | 26%-faster, 4%-more-area  |
| Dataflow& Control    | 16 ns       | 7,620-gates  | 13-ns                      | 8,019-gates  | 20%-faster, 5%-more-area  |
| Dataflow             | 22 ns       | 4,990-gates  | 18.5-ns                    | 5,109 gates  | 16% faster, 2% more area  |
| Dataflow             | 28 ns       | 31,226-gates | 26-ns                      | 32,032 gates | 8% faster, 2% more area   |
| Dataflow             | 26.2-ns     | 14,351-gates | 23.6-ns                    | 13,847 gates | 10% faster, 4%-less area  |
| Dataflow             | 25.9-ns     | 16,798-gates | 20.8-ns                    | 15,550 gates | 20%-faster,-7%-less-area  |
| Dataflow             | 45 ns       | 28,705-gates | 26-ns                      | 30,987 gates | 42%-faster, 8%-more-area  |

 RTL designs have a single clock net and were synthesized to gates using Synopsys Design Compiler.



• Design type: dataflow implies significant number of operators; control implies state machine dominated.



## Summary



- ◆High Level Synthesis (HLS)
- **♦**Scheduling and Binding
- **♦**Operation Scheduling
  - > ASAP Scheduling
  - > ALAP Scheduling
- Constrained Scheduling
  - > Hu's Algorithm
  - ➤ List Scheduling
  - ➤ Force-Directed Scheduling
  - ➤ Linear Programming



2021/3/16

Andy Yu-Guang Chen