# 專題書面報告格式規定

- ■專題書面報告需包含:
  - (1). 專題名稱、組長組員學號姓名(內容中不得揭露指導教授姓名)
  - (2). 摘要
  - (3).內文:專題研究動機與目的、現有相關研究概況及比較、設計原理、研究方法 與步驟、系統實現與實驗、專題重要貢獻、團隊合作方式、效能評估與成果、 結論......等
  - (4). 參考文獻

### ● 文件編排格式建議,提供參考:

- 1. 請使用 A4(21cm x 29.7cm), Microsoft Word 7.0 以上編輯。
- 2. 版面設定:

上邊界及下邊界: 2.54cm, 左邊界及右邊界: 3.17cm。頁碼置於下方置中。

3. 內文文字:

中文請用新細明體 12 點字體,標點符號請用全型,英文、數字請用 Times New Roman 12 點字體,單行間距,由左向右打字。

4. 圖表之處理:

圖表置於正文內。<u>表的名稱置於表上方,圖的名稱置於圖下方</u>,並以阿拉伯數字(如表 1、表 2、表 3、圖 1、圖 2、圖 3 等)依序編號。

對圖、表內容 (如表中之符號) 做簡要說明時,請置於圖、表下方。

5. 參考文獻:

文獻部份請將中文列於前,英文列於依序編號後。

#### 補充:

要做目錄嗎?

=>摘要及目錄,請以i,ii,iii 等小寫羅馬數字連續編頁。 本文及附錄,以1,2,3等阿拉伯數字連續編頁。

參考文獻沒中文的話?

=>本章請列出報告中所引用或參考之文獻,包含文獻名稱、作者、書名或刊登的期刊、出版日期或卷數、在期刊中的頁數等必要資訊。

在報告中,應在引用或參考的段落後註明引用的文獻編號,以方括號表示之,如 [2]、[3]表示引用或參考第2項、第3項文獻。

- [1] N. Mohan, T. M. Undeland, and W. P. Robbins, Poewr Electronic, Converters, Applications and Design, Wiley, New York, USA, 1995.
- [2] Y. Y. Tzou, "DSP-based Fully Digital of a PWM DC-AC Converter for AC Voltage Regulation," IEEE Power Electronics Specialists Conference, Atlanta, USA, June 1995, pp. 138-144.
- [3] A. V. Jouanne, P. N. Enjeti, and D. J. Lucas, "DSP Control of High Power UPS Systems Feeding Nolinear Loads," IEEE Transactions on Industrial Electronics, Vol. 43, No. 1, Feb 1996, pp. 121-125.

# 國立清華大學資訊工程系 110 學年度上學期專題報告

| 專題名稱        | Verilog Simulation Optimization via Instruction Reduction |           |          |        |               |  |  |
|-------------|-----------------------------------------------------------|-----------|----------|--------|---------------|--|--|
| 參加競賽<br>或計畫 | ■ 參加對外競賽                                                  |           | □ 參與其他計畫 |        | □無參加對外競賽或任何計畫 |  |  |
| 學號          | 107062134                                                 | 107000118 |          | 107060 | 024           |  |  |
| 姓名          | 樊明膀                                                       |           | 劉家豪      | 許鎧博    |               |  |  |

# 摘要

Since the complexity of a SOC increases exponentially, functional verification becomes a time-consuming problem in the design cycle. To reduce the main reason contributing to the low simulation efficiency on CPU time, contestants are required to decrease the number of continuous assignments and bit–selects.

This objective is to facilitate the simulator by a parser, which performs Verilog-to-Verilog transformation while minimizing instructions. The optimized Verilog file must be syntactically correct, inviolate the given limitations, and be the functional equivalent to the input Verilog file.

Under the constraints and regulations, this project provides a different data structure to save the parsing instructions rather than traditional graphs. Our data structure having the good property to represent the circuit directly without any transformation is constructed to implement the logic optimization. This reduces the complexity during the optimization stage, contributing to exploring potential optimization strategies on the following optimization. In addition, to implement traditional optimization methods, this project provides novel strategies to further improve simulation efficiency.

As the result, one of these new approaches can get a score ten times better than before, the instructions in one testcase can be reduced to one over seventy-five compared to the original one. Even some testcases can be simplified to the simplest form of them by our work. All of the members in this project participate in the CAD contest, but each of us forms a single-member team. **Two of the members won the ICCAD awards, one is the first prize and the other is fourth prize.** 

# **Table of contents**

- 1. Introduction
- 2. Objective & Attribution
- 3. Problem Formulation
  - 3.1 Program Requirement
  - 3.2 Original Verilog Design
  - 3.3 Output Optimized Verilog Design
  - 3.4 Verification of Simulation Correctness
  - 3.5 Evaluation
- 4. Methodology Compared to Current Research
- 5. Implementation
  - **5.1 Data Structure**
  - **5.2** Basic Logic Synthesis And Basic Logic Rules
  - **5.3 Substitution**
  - **5.4 Pattern Comparison**
  - **5.5 Commutative Law**
  - **5.6 Further Pattern Comparison**
  - 5.7 SAT Solver Based Redundant Wire Checking
  - 5.8 The usage of xformtmp
  - **5.9 Gate Merging**
  - 5.10 Output Merging
- 6. Result
- 7. Conclusion
- 8. Reference

## 1. Introduction

Since the size and complexity of a SOC keep growing exponentially to satisfy various applications, functional verification gradually becomes the most time-consuming part of the whole design cycle [1]. How to perform functional verification in a stringent time-to-market window is undoubtedly a critical problem in the EDA field.

The simulator is an essential tool for functional verification from RTL to gate-level, whose performance is directly related to the IC development progress. However, when the design size exceeds a certain amount, the simulation time becomes unacceptable. IC designers may spend several weeks in gate-level simulation with millions of gates in RTL [2]. In other words, the simulator performance enhancement is beneficial for the whole industry.

The objective of this contest topic, Verilog Simulation Optimization via Instruction Reduction [3], is to enhance the performance of the simulator from the algorithm point of view. The contestants are asked to write a program performing Verilog-to-Verilog transformation to minimize instructions/assignments for performance consideration. The generated Verilog file must be syntactically correct, inviolate the given limitations, and be functional equivalent to the input file, then simulation efficiency will be evaluated..

# 2. Objective & Attribution

This project focuses on facilitating Verilog simulation by decreasing the number of instructions. Without a doubt, keeping the correctness of output is a necessity. Based on the efficiency of Verilog simulation is proportional to the CPU time spent on instructions execution. Simply put, if the instructions are reduced, the simulation efficiency grows.

In this problem, contestants are asked to write a C++ parser to convert a given Verilog design into an optimized Verilog design. This source-to-source transformation only needs to be considered in continuous assignments, bit/part-selects of vectors, bit operations, and wire type signals to further optimize for the given Verilog design. Consequently, the objective is to minimize the total number of continuous assignments and vector bit/part-selects in the given Verilog design.

## 3. Problem Formulation

As mentioned above, the contestants are required to develop a C++ program to optimize the input Verilog file. The requested program "verilogopt" takes a Verilog design "original.v" as an input and outputs the optimized Verilog design "optimized.v". Besides, several restrictions need to be obeyed during the optimization. Only when passing the verification of simulation correctness, the program can be evaluated by the

testcases with 2-state input values, 0 & 1. The following requirement is referenced from "Verilog Simulation Optimization via Instruction Reduction".

### 3.1. Program Requirement

Implementation of a Verilog parser is necessary to this problem, any third-party opensource are not allowed in the program. The input Verilog file" original.v" uses very restrictive Verilog syntax to confirm the consistency when parsing. However, the output Verilog file "optimized.v" provides flexible syntax to create innovation optimization methods.

## 3.2. Original Verilog Design

The input of this problem is a given Verilog file named "original.v", which contains two modules: dut and tb. The module dut is the main module to be optimized and the module tb is merely an auxiliary module for checking the simulation correctness. There are only two packed ports, out and in, which are by default 2-state wires, 0 or 1 as follow:

```
module dut (out, in);
  output[SIZEOUT:0] out;
  input[SIZEIN:0] in;
```

In addition to the ports, a bunch of 1-bit wire signals is declared locally in the module dut as temporary variables as the below example.

```
wire origtmp1;
wire origtmp2;
...
wire origtmpN;
```

The following expressions of ports and local wires are all continuous assignments, to avoid confusion, Backus-Naur Form (BNF) is adopted.

```
assign < LHS > = < RHS >;
```

<LHS> is either a bit-select on port-out or a temporary wire: out[#] or origtmp# (where # denotes constant).

```
<LHS> ::=
out[#]
| origtmp#
```

<RHS> is either one item, one item with the unary operation, or a binary operation with two items.

<ITEM> can be either a 4-state constant, a bit-select on port out, a bit-select on a port in, or a temporary wire.

<UNARY OP> is the bitwise negation operator.

<BINARY OP> is one of the following binary bitwise operators.

Consequently, after considering all regulations above, here is an example of input Verilog file.

```
module dut (out, in);
    output[3:0] out;
    input[15:0] in;
    wire origtmp1;
    assign origtmp1 = 1'b0;
    assign out[2] = in[0] | origtmp1;
    assign out[3] = origtmp1 | 1'b0;
endmodule
```

```
module tb();
    reg[3:0] results;are
    reg[15:0] feed[1];
    dut duttest(results, feed[0]);
    initial begin
    $readmemb("data.txt", feed);
    $display(results);
    end
endmodule
```

## 3.3. Output – Optimized Verilog Design

The output Verilog file "optimized.v" also contains two modules: dut and tb. The module dut represents the optimized instructions reduced by contestants, the module tb needs to be identical to that in "original.v". To find innovation optimization strategies, a little flexible syntax is allowed in this problem. The followings are the regulations to syntax:

The declaration of extra temporary wires is permitted for a single or multiple-bit wide. However, multiple bits are required to be one-dimensional and packed with either ascending or descending indices.

```
// 1-bit → OK
wire xformtmp1;
// packed 4-bit, descending indices from 3 to 0 → OK
wire[3:0] xformtmp2;
// packed 7-bit, ascending indices from 0 to 6 → OK
wire[0:6] xformtmp3;
// unpacked 2-bit → NOT allowed
wire xformtmp4[1:0];
// packed 2 or more dimensions → NOT allowed
wire[7:0][1:0] xformtmp5;
```

<LHS> allows 2 more types: xformtmp# and out[MSB:LSB]. In Verilog syntax, myvect[MSB:LSB] denotes 'part select' of vector myvect, which means consecutive bits from index MSB to index LSB (width equals to MSB-LSB+1). According to IEEE standard, though ascending or descending indices are both fine, every use must be consistent with its declaration.

<RHS> is the same as input Verilog.

<ITEM> allows four extra types: out[MSB:LSB], in[MSB:LSB], xformtmp#, and xformtmp#[MSB:LSB]. Additionally, it could be a N-bit 2-state constant.

<UNARY OP> is the same as input Verilog.

<BINARY OP> is the same as input Verilog.

### 3.4. Verification of Simulation Correctness

A given input file "data.txt" which contains a one-line text string of characters: 0, 1 is used to verify the correctness of simulation. For instance, 0111100100011001 may the data to read in is for 'reg[15:0] data[1]'. Simulation results, i.e., the value of out-port vectors are dumped via \$writememb. Simulation correctness is checked by comparing the \$writememb results of "original.v" and "optimized.v". In this project, we use ABC: A System for Sequential Synthesis and verification as the formal verification tool when facing the small circuit. For big circuit, we only can use the Synopsys VCS tool stimulate our circuit and compare the result.

Suppose we have the following "original.v".

```
module dut (out, in);
  output[3:0] out;
  input[15:0] in;
```

```
wire origtmp1;
    assign origtmp1 = 1'b0;
    assign out[2] = in[0] | origtmp1;
    assign out[3] = in[1] | 1'b0;
endmodule
```

• The "original.v" contains 3 continuous assignments of original, out[2], out[3] and 5 bit-selects of 1'b0, in[0], original, in[1], 1'b0.

Then we may derive the following 3 versions of valid "optimized.v", which represents the different degrees of the optimization.

```
Version 1:

module dut (out, in);
output[3:0] out;
input[15:0] in;
assign out[2] = in[0] | 1'b0;
assign out[3] = in[1] | 1'b0;
endmodule
```

• The "optimized.v" contains 2 continuous assignments and 4 bit/part selects: out[2], in[0], out[3], in[1].

```
Version 2:

module dut (out, in);
output[3:0] out;
input[15:0] in;
wire[1:0] xformtmp1;
assign xformtmp1 = 2'b0;
assign out[3:2] = in[1:0] | xformtmp1;
endmodule
```

• The "optimized.v" contains 2 continuous assignments and 2 bit/part selects: out[3:2], in[1:0].

```
Version 3:

module dut (out, in);

output[3:0] out;

input[15:0] in;

assign out[3:2] = in[1:0];

endmodule
```

• The "optimized.v" contains 1 continuous assignments and 4 bit/part selects: out[3:2], in[1:0].

### 3.5. Evaluation

For each "optimized.v", a final score will be concluded from simulation correctness, optimization regulation, execution time limit, and simulation efficiency. Simulation correctness, execution time limit, and optimization regulation are priority requirements. If any of them is violated, simulation efficiency will NOT be evaluated any further due to the content of that matrix being set to all zero.

Simulation correctness is evaluated by CSCORE, if any differences are found to the "original.v", then CSCORE will be 0, whereas CSCORE is set to 1.

```
• CSCORE = 1 (if no diffs)
```

• CSCORE = 0 (if any diffs are found)

Optimization regulation is represented by RSCORE, if any regulation is violated, then Rscore will be 0, vice versa.

- RSCORE = 1 (if all regulations are followed)
- RSCORE = 0 (if any regulation is violated)

Execution time limit is set as 600 seconds that cannot be exceeded for each testcase.

- TSCORE = 1 (if the program executes 600sec or less)
- TSCORE = 0 (if the program executes longer than 600sec)

Simulation efficiency is measured with the total number of two types of assignments in the module dut in "optimized.v" as follows:

- continuous assignments (denoted as <COUNT ASGN>)
- bit/part-selects appearing in all continuous assignments (denoted as <COUNT SELS>)
- ESCORE = (<OUT\_PORT\_WIDTH> + <IN\_PORT\_WIDTH>) / (<COUNT\_ASGN> + <COUNT\_SELS>)

The number of (<OUT\_PORT\_WIDTH> + <IN\_PORT\_WIDTH>) is fixed in each "original.v". The lower the number of (<COUNT\_ASGN> + <COUNT\_SELS>) indicating fewer CPU instructions needed, the higher the ESCORE is. And FSCORE represents the final score of each testcase like the following.

```
FSCORE = CSCORE * RSCORE * TSCORE * ESCORE
```

To show how various optimization methods influence testcases with different scores, there are three versions of "optimized.v" in the following:

```
Version 1:

module dut (out, in);
output[3:0] out;
input[15:0] in;
assign out[2] = in[0] | 1'b0;
assign out[3] = in[1] | 1'b0;
endmodule
```

• The "optimized.v" contains 2 continuous assignments and 4 bit/part selects: out[2], in[0], out[3], in[1]. <COUNT\_ASGN> + <COUNT\_SELS> = 2 + 4 = 6.

```
Version 2:

module dut (out, in);

output[3:0] out;

input[15:0] in;

wire[1:0] xformtmp1;

assign xformtmp1 = 2'b0;

assign out[3:2] = in[1:0] | xformtmp1;

endmodule
```

• This "optimized.v" contains 2 continuous assignments and 2 bit/part selects: out[3:2], in[1:0]. <COUNT ASGN> + <COUNT SELS> = 2 + 2 = 4.

```
Version 3:

module dut (out, in);

output[3:0] out;

input[15:0] in;

assign out[3:2] = in[1:0];

endmodule
```

• The "optimized.v" contains 1 continuous assignments and 2 bit/part selects: out[3:2], in[1:0]. <COUNT\_ASGN> + <COUNT\_SELS> = 1 + 2 = 3.

As the number of (<OUT\_PORT\_WIDTH> + <IN\_PORT\_WIDTH>) is fixed, the version 3 of all "optimized.v" leads to the smallest (<COUNT\_ASGN> + <COUNT\_SELS>) and highest FSCORE, 6.67 among all three versions of "optimized.v"

Table 1. Different FSCORE to each "optimized.v"

| Version                    | Version 1   | Version 2 | Version 3   |
|----------------------------|-------------|-----------|-------------|
| <count_asgn>+</count_asgn> | 6           | 4         | 3           |
| <count_sles></count_sles>  |             |           |             |
| FSCORE                     | 20/6 = 3.33 | 20/4 = 5  | 20/3 = 6.67 |

Both public and hidden testcases will be used to evaluate a contestant's program. The ranking of the contest is based on the summation of the normalized FSCORE compared to all contestants from each testcase. For instance, if the highest FSCORE in testcase 1 is 0.1 and your FSCORE is 0.01, then the highest FSCORE will be set to 100, and your normalized FSCORE will be 10.

# 4. Methodology Compared to Current Research

In recent years, researchers have usually used traditional graphs such as AIG to do logic optimization [4]. However, we use a data structure similar to the binary tree to represent the circuit rather than AIG. The complexity of traversing the circuit will be decreased by this data structure, which has good properties to deal with don't care and pattern comparison implementation

What's more, we find that there are few kinds of research talking about how to optimize the RTL file which has the XOR gates. In our program, we develop this structure to optimize the Verilog file and the circuit accompanied by the XOR gates. Fig. 1 is the flowchart of our work, which will be elaborated on in the next part.



Figure. 2 Flowchart

# 5. Implementation

The implementation procedure is followed the flowchart above. First of all, our parser will take the given Verilog file as input, then build a data structure similar to the binary tree. Basic logic rules and the substitution method are applied to obtain a reduced Verilog file and start to implement local optimization.

We first consider to minimize on the tree structure, some Boolean identities are used to find the subtree patterns of the functional equivalent equations, then the structure of the subtree can be rewritten to optimize the circuit to implement pattern matching. After replacing the tree structure, SAT-solver-based redundant wire checking focuses on removing the wires.

Gates merging is to compare two circuits and merge gates that have the same functionality. Output merging is an operation to combine the continuous output with the same type in both input and output. Last but not least, we substitute repeated prime implicant with temporary wires, Xformtmp, then output the optimized Verilog file.

### 5.1. Data Structure

We define a node with the following information: operation, left type, left value, right type, right value, and flag as Figure. 2. A node is used to record one assignment of the input Verilog file. Before using the parser to extract the information, we will read the information of the input file such as the input and output size, and the number of the origtmp size. After we have read the input information, we will construct two arrays (output array and origtmp array). The data structure of a cell of the array will be the node we have constructed previously.

The right\_type and left\_type of the node structure will be three conditions (origtmp, in, constant). The right\_value and left\_value will denote the index if the type is origtmp or in; those will be the constant value (1'b1, 1'b0, 1'bz, 1'bx) if the type is const (constant).



Figure. 2 Data Structure

Once we finish the parser of the input file and get all information of the input assignments, we will now consider constructing the type of data structure. Since we

would have the perfect structure that would only contain two inputs in one assignment. Therefore, for each output node (out[#]), we can view it as the root of the binary tree.

Before we construct the tree, we have to define the tree node of the binary tree. With the data structure of the binary tree, we will next construct a binary for each output node with the help of the previous constructed arrays (output and original). We use the "factoringTree" function to implement.

Be based on the left type/value and right type/value to point the current node's pointer to the right array cell. Therefore, we will get the final binary tree. If the size of the output is M and the size of originary is N, the complexity of the factoring tree will be O(M+N) since we will read the arrays to factor the tree. Also, since we will go through the entire tree to traverse the node, the complexity of traversal will be O(M+N).

### 5.2. Basic Logic Synthesis And Basic Logic Rules

In Boolean algebra, we all know several Boolean identities are used to minimize the Boolean function. First of all, we will list some common constant operations of the Boolean function in the following table.



Figure. 3 Basic logic operations

Therefore, we will go through all the cells of the array to see whether there is possible bits operation in the first step minimization. Also, note that high z signals and don't care signals will have the same behavior in the table. Therefore, we will view those two signals in the same variable (constant 2). When a cell of the array consists of a constant and a variable, the Boolean identities also tell us that we can use those identities to minimize the function as follow:

1. 
$$A|0 = A$$
  
2.  $A|1 = 1$ 

```
3. A|2 = either 1 or 2 ( = 1 when A is 1). For simplicity, A|2 = 1

4. A\&0 = 0

5. A\&1 = A

6. A\&2 = 0 ( same reason as 3.)

7. A^0 = A

8. A^1 = A

9. A^2 = 2 ( by the previous chart )
```

### 5.3. Substitution

After minimizing the inner data structure, we hope to combine or substitute some variables with the corresponding equations. For example, if we have two equations:

It is obvious that we can substitute original with  $in[0] \mid in[1]$  into out[1] for the first case. When we represent the equations as our data structure, we can see the relationship between the two equations. Also, we will hope that we will start our minimization from the leaves; hence, we will first use the postorder traversal to start the minimization from roots.

In the substitution method, the worst case will be going through all nodes of the binary tree. Therefore, if the size of the output is M and the size of originary is N, then the overall complexity of second step minimization will be O(M+N).

## 5.4. Pattern Comparison

In the third step minimization, we offer several Boolean equations that can be minimized as the equations below. If we use tree traversal to find the subtree patterns of the equations, then we can rewrite the structure of the subtree to optimize the circuit.

```
1. a|(a\&b) = a

2. a\&(a|b) = a

3. a|(\sim a\&b) = a|b

4. a\&(\sim a|b) = a\&b

5. \sim a\&b \mid a\&\sim b = a^b

6. \sim(\sim a) = a (Self-Negation)

7. a\&(\sim a) = 0 (Double Negation)
```

For example, the previous equations will have the structure like previously the following graph:



Figure. 5 The pattern comparison example of a|(a&b) = a



Figure. 6 The pattern comparison example of  $a|(\sim a\&b) = a|b|$ 





Figure. 7 The pattern comparison of  $\sim a\&b \mid a\&\sim b = a^b$ 

#### 5.5. Commutative Law

In the Boolean equation, several gates have the commutative identity. Therefore, it is likely to get an optimized function with the help of commutative law. For example,

```
Before Commutative Law:

assign out[0] = origtmp1 & origtmp2;
assign origmp1 = in[0] & in[2];
assign origtmp2 = in[1] & in[2];

After Commutative Law:
assign out[0] = origtmp1 & origtmp2;
assign origmp1 = in[0] & in[1];
assign origtmp2 = in[2] & in[2];

Applying the previous minimization:
assign out[0] = origtmp1 & in[2];
assign origtmp1 = in[0] & in[1]
```

With the help of pattern comparison, we can get the optimized circuit in the simple tree traversal. However, if we use the method to revise the circuit, we will change the functionality of the circuit (We have changed the functionality of origtmp1 and origtmp2). Therefore, we will need to construct an additional wire to help us implement the commutative law. The equation should be revised as:

Also, we will hope that we will start our minimization from the leaves; hence, we will first use the postorder traversal to start the minimization from roots. Note that we can only use origitmp, xformtmp wire in the output file. Therefore, the additional wires should only be named as xformtmp. The dynamic method to allocate the xformtmp can be referred to the following graph:



Figure. 8 Commutative Law

Also, we will hope that we will start our minimization from the leaves; hence, we will first use the postorder traversal to start the minimization from roots. In the third step minimization, the worst case will be going through all nodes of the binary tree. Therefore, if the size of the output is M and the size of originary is N, then the overall complexity of second step minimization will be O(M+N).

## 5.6. Further Pattern Comparison

In further pattern comparison, we offer several Boolean equations that can be minimized as the equations below. If we use tree traversal to find the subtree patterns of the equations, then we can rewrite the structure of the subtree to optimize the circuit.

```
1. a^(a\&b) = a\&\sim b;
```

2. 
$$a^{(a|b)} = -a\&b$$
;

3. 
$$a&(a^b) = a&-b$$
;

4. 
$$a|(a^b) = a|b;$$

5. 
$$a^{\sim}(a\&b) = \sim a|b;$$

6. 
$$a^{\sim}(a|b) = a|\sim b$$
;

7. 
$$a&(a^b) = -a&b$$
;

8. 
$$a|(a^b) = -a|-b$$
;

We will use pattern comparison the same as the previous pattern comparison; however, we will use the additional wires to avoid the revision of the functionality. We will hope that we will start our minimization from the leaves; hence, we will first use the postorder traversal to start the minimization from roots.

## 5.7. SAT Solver Based Redundant Wire Checking

In the EDA field, we will use testing (stuck-at-fault) to check the possible faults in the chips. However, stuck-at-fault can also help us to find out the redundant wire of the circuit. For a combination circuit, an undetectable fault is corresponding to a redundant wire. Undetectable faults do not change the functionality of the circuit. The reason can be referred to here.



Figure. 9 Remove redundant wires by stuck-at-fault test

The previous graph shows when the s-a-1 fault is assigned to an AND gate. No matter which value m and n are, wire 1 can always be viewed as the redundant wire. On the other hand, we will get three redundant wires when wire 1 is assigned to constant 0.

How can we say the wire is redundant? As previously mentioned, we will get the same functionality as the original circuit and the stuck-at-fault circuit. Therefore, we will need to generate the whole possible inputs to compare the functionality of the original circuit and the stuck-at-fault circuit.

Note that we will use the XOR gate to check two circuits to check whether the circuits are equivalent or not to the following equations.

```
f = ab+ac, fa = ac

Ta = the set of all tests for fault a

= ON_set(f^fa)

= ON_set(f) * OFF_set(fa) + OFF_set(f) * ON_set(fa)

= {(a, b, c) | (ab+ac)(ac)' + (ab+ac)'(ac) = 1}

= {(a, b, c) | abc' = 1}

= {(110)}
```

Figure. 10 Stuck-at-fault test example

In the final equation of the example, we tend to find the possible combination of a, b, and c for equation abc' = 1. Note that this is a SAT problem; therefore, we will use an SAT solver to check the circuit. We can find lots of open-source modern SAT solvers. With the help of modern SAT solvers, we can efficiently check if the wire is redundant or not. In this project, we will use Glucose SAT.

In order to implement the SAT-based redundant wire checking, we will need to first duplicate the original circuit for a stuck-at-fault circuit.

After we duplicate the tree, we will need an additional XOR gate to connect two circuits into a miter circuit. Then, we can next assign either s-a-1 or s-a-0 fault into an assigned

wire. When finishing assigning the wire, we will use Tseitin transformation to transform the miter circuit into a form that can be used by SAT solver.

Tseitin transformation takes as input an arbitrary combinational logic circuit and produces a Boolean formula in conjunctive normal form (CNF), which can be solved by a CNF-SAT solver. The length of the formula is linear in the size of the circuit. Input vectors that make the circuit output "true" are in 1-to-1 correspondence with assignments that satisfy the formula. This reduces the problem of circuit satisfiability on any circuit (including any formula) to the satisfiability problem on 3-CNF formulas.

The one-to-one relation of the logic gate and Tseitin transformation is shown in the following graph.

| Туре    | Operation                   | CNF Sub-expression                                                                                                                                               |
|---------|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| AND AND | $C = A \cdot B$             | $(\overline{A} ee \overline{B} ee C) \wedge (A ee \overline{C}) \wedge (B ee \overline{C})$                                                                      |
| NAND    | $C = \overline{A \cdot B}$  | $(\overline{A} ee \overline{B} ee \overline{C}) \wedge (A ee C) \wedge (B ee C)$                                                                                 |
| OR      | C = A + B                   | $(A ee B ee \overline{C}) \wedge (\overline{A} ee C) \wedge (\overline{B} ee C)$                                                                                 |
| NOR     | $C = \overline{A + B}$      | $(A \lor B \lor C) \land (\overline{A} \lor \overline{C}) \land (\overline{B} \lor \overline{C})$                                                                |
| NOT     | $C=\overline{A}$            | $(\overline{A} ee \overline{C}) \wedge (A ee C)$                                                                                                                 |
| XOR     | $C = A \oplus B$            | $(\overline{A} \vee \overline{B} \vee \overline{C}) \wedge (A \vee B \vee \overline{C}) \wedge (A \vee \overline{B} \vee C) \wedge (\overline{A} \vee B \vee C)$ |
| XNOR    | $C = \overline{A \oplus B}$ | $(\overline{A} ee \overline{B} ee C) \wedge (A ee B ee C) \wedge (A ee \overline{B} ee \overline{C}) \wedge (\overline{A} ee B ee \overline{C})$                 |

Figure. 11 Tseitin transformation

We now can use the Tseitin transformation to build the clause for the input of the SAT solver. We have mentioned the one-to-one relation between the original gates and the tseitin form. Therefore, we can apply the SAT solver to find the redundant wire.

## 5.8. The usage of xformtmp

First usage: The in[1] would be counted as one cost for bit selection, but xformtemp wouldn't. Thus, we use the xformtmp to substitute the input node which appears multiple times in the primary input of other gates, like xformtmp1 = in[1]; The second usage mentioned in the "Output merging" operation and the "xformtmp" is the new temporary wire of sequential node.

```
Before the usage of xformtmp:

assign out[1] = in[1] ^ in[2];

assign out[2] = in[2] ^ in[3];

assign out[3] = in[2] ^ in[4];

After the usage of xformtmp:

assign out[1] = in[1] ^ xformtmp1;

assign out[2] = xformtmp1 ^ in[3];

assign out[3] = xformtmp1 ^ in[4];

assign xformtmp1 = in[2];
```

- The score of the original one, counting assign = 3, bit selection = 9
- The score of optimized one, counting assign = 4, bit selection = 5

From the above, we see the optimized one has the better score. The second usage mentioned in the "Output merging" operation and the "xformtmp" is the new temporary wire of sequential node.

### 5.9. Gate Merging

Gates merging is to compare two circuits and merge gates that have the same functionality for each node one by one. If the program finds two nodes that have the same child node, put the parent nodes' pointer of one of them onto another one. The process is bottom-up. Time complexity is  $O(n^2)$ . Because the time complexity isn't well, we need to implement an algorithm like simulated annealing and a limit for this process.



Figure. 12 Gate merging

Assume the operators of tmp1 and tmp3 are the same and the tmp2, tmp4 also.

## 5.10. Output Merging

Output merging is an operation to combine the continuous output with the same type in both input and output. For example, the equations out[1] = 1'b1, out[0] = 1'b0 should be combined together into out[1:0] = 2'b10 for better score.

According to the evaluation of problem spec, we should put the primary output in a sequence form in Verilog for a better score. The equations, out[1] = 1'b1, out[0] = 1'b0, should be combined together into out[1:0] = 2'b10.

What's more, the problem is more complicated, out[1] = a  $^b$  out[2] = c  $^d$ , a=in[1]  $^i$  n[5] b = in[3]  $^i$  in[7] c=in[2]  $^i$  in[6] d=in[4]  $^i$  in[8], should be combine as out[2:1] = xformtmp1[1:0]  $^i$  xformtmp2[1:0], xformtmp1[1:0] = in[2:1]  $^i$  in[6:5], xformtmp2[4:3] = in[4:3]  $^i$  in[8:7].

The first step is to check the primary output node if it is a sequence. The second step is putting "a" and "c" wires together and "b", "d" wires are also. The reason why we can put a and c wire together is that the child nodes are an input sequence and they have the same operators. When once find the above relation, we put them together and create a new temporary wire, and this new wire I call it "sequential node." Then, if the child node of "a" wire isn't an input node, we should take its child node as a sequential node at beginning until the program processes the children and get the real answer back. Thus, this whole process is an iteration from the primary output to input and get the result from the bottom node.



Figure. 13 Output merging with sequential X nodes

Also, we found the scoring of the output sequence has a connection with their primary input nodes. If one node in the process isn't sequential, the total structure cannot form a perfect sequence and it would increase the cost of counting bit selections but sometimes still can decrease the total score. Below have a few examples:

```
Before Output Merging:
out[1] = a ^ b;
out[2] = c ^ d;
a = in[2] ^ in[4];
b = in[11] ^ in[8];
c = in[12] ^ in[14];

After Output Merging:
out[2:1] = xformtmp1[1:0] ^ xformtmp2[1:0];
xformtmp1[0] = in[2] ^ in[4];
xformtmp1[1] = in[1] ^ in[8];
xformtmp2[1:0] = in[12:11] ^ in[13:14];
```

- The score of original one, counting assign = 6, bit selection = 10
- The score of optimized one, counting assign=4, bit selection= 12

From the above, we can know that if the sequence isn't the perfect sequence. The optimized one has a higher bit selection cost but a lower assign cost. Then have on to the next example.

```
Before Output Merging:
                                       After Output Merging:
out[1] = a \wedge b;
                                       out[3:1] = xformtmp1[2:0] ^ xformtmp2[2:0];
out[2] = c \wedge d;
                                       xformtmp1[0] = in[2] \land in[5];
out[3] = e \wedge f;
                                      x formtmp1[1] = in[3] ^ in[7];
                                       xformtmp1[2] = in[4] ^ in[9];
a = in[2] ^ in[5];
b = in[3] ^ in[7];
                                       xformtmp2[2:0] = in[13:11] \land in[16:14];
c = in[11] ^ in[14];
d = in[12] \land in[15];
e = in[4] ^ in[9];
f = in[13] \land in[16]
```

- The score of original one, counting assign = 9, bit selection = 15
- The score of optimized one, counting assign=5, bit selection= 15

This example shows that the optimized one has a better score even though it has the higher bit selection cost. This one also shows that it has a longer sequence than the first example and both of their output nodes have one side that isn't the sequential node but the longer sequence one has the better performance.

```
Before Output Merging:
                                             After Output Merging:
out[1] = origtmp1 \land origtmp2;
                                             out[2:1] = xformtmp1[1:0] \land xformtmp2[1:0];
out[2] = origtmp3 ^ origtmp4;
                                            xformtmp1[0] = e ^ f;
origtmp1 = e \wedge f;
                                            xformtmp1[1] = in[3] ^ in[7];
                                            xformtmp2[1:0] = in[12:11] \land in[15:14];
origtmp3 = in[3] \wedge in[7];
                                             e = in[4] ^ in[9];
origtmp2 = in[11] ^ in[14];
                                            f = in[13] \land in[16];
origtmp4 = in[12] \land in[15];
e = in[4] ^ in[9];
f = in[13] \land in[16];
```

- The score of original one, counting assign = 8, bit selection = 12
- The score of optimized one, counting assign=6, bit selection= 14

This one is actually like the first example. The original one and the optimized one have the same "e" and "f" wires. Using the in[2], in[4] to replace the e, f wires, the circuit would be the first example.

```
After the first transition:
Before Output Merging:
out[1] = a \wedge b;
                                       out[2:1] = xformtmp1[1:0] ^ xformtmp2[1:0];
out[2] = c \wedge d;
                                       xformtmp1[1:0] = xformtmp3[1:0] ^ xformtmp4[1:0];
a = e \wedge f;
                                      x formtmp2[1:0] = in[12:11] \land in[15:14];
c = g \wedge h;
                                       xformtmp3[1] = in[4] ^{n} in[9]; -> not sequential
b = in[11] ^ in[14];
                                       x formtmp3[0] = in[1] ^ in[5];
d = in[12] \wedge in[15];
                                       xformtmp4[1] = in[3] ^{\land} in[7]; -> not sequential
e = in[4] ^ in[9];
                                       xformtmp4[0] = in[0] ^ in[6];
f = in[0] \wedge in[6];
g = in[1] ^ in[5];
h = in[3] ^ in[7];
```

```
After the second transition:
out[2:1] = xformtmp1[1:0] \land xformtmp2[1:0];
xformtmp1[1] = origtmp7 ^ origtmp8;
xformtmp1[0] = origtmp5 ^ origtmp6;
xformtmp2[1:0] = in[12:11] \land in[15:14];
origtmp5 = in[4] ^ in[9];
origtmp6 = in[0] \wedge in[6];
origtmp7 = in[1] ^ in[5];
origtmp8 = in[3] \wedge in[7];
```



The circuit through the first transition, which takes the wire as sequential first, would have a worse performance but the circuit through the second transition would have the meaning of the fourth example. Thus, the circuit through the second transition and the original one are the same cost.

From the above, I define the node how to return its property when the program processes the node. If the operator of the optimized node is "not" or no operator and the node have a sequential node, it should return "It is a sequential node" to its parent node. If the operator of the optimized node is "|", "&" or "^" and the node has two sequential nodes, it should return "It is a sequential node". The else conditions would make the node return "It isn't a sequential node". With the iteration in the program, the output node would know whether it can form a sequence or not. In addition, I would like to evaluate the cost after the progression and compare the optimized one and the original one and make the optimized one has a better score.

## 6. Result

The following experimental results in the public 9 testcases were conducted in the Linux environment provided by the CAD Contest as Table 2, which represents the score comparison between the original Verilog file and the optimized one.

One of these approaches can get a score over ten times better than the original case. For the final test, we have seventeen testcases over 96 normalized FSCORE over twenty testcases, and nine testcases get 100 normalized FSCORE. Even some test cases can be simplified to the simplest form of them by our work. As the result, After the "Final Test", we won the ICCAD Contest First and Fourth prizes.

Table 2. The FSCORE of each testcases

| version | original  | optimized | version     | original    | optimized |
|---------|-----------|-----------|-------------|-------------|-----------|
| case 1  | 0.365854  | 0.566038  | case 6      | 0.000499875 | 0.0376689 |
| case 2  | 0.410256  | 0.610687  | case 7      | 0.397772    | 0.642467  |
| case 3  | 0.404218  | 0.633609  | case beta 1 | 0.523256    | 1.95652   |
| case 4  | 0.0714286 | 0.571429  | case beta 2 | 0.106956    | 0.123919  |
| case 5  | 0.444444  | 4         |             |             |           |

## 7. Conclusion

Based on the result mentioned above, our project not only implements traditional optimization methods but provides novel strategies to further improve the simulation efficiency. One of these approaches can get a score almost ten times better than before, a test case can be reduced to one over seventy-five compared to the original testcase. Even some test cases can be simplified to the simplest form of them by our work.

Our data structure contributes to decreasing the complexity of traversing the circuit, which has the great property that it can directly represent the circuit without any transformation. For example, to transform the circuit as AIG, the synthesis tool will add additional inverters into the circuit since there will only be two types of the gate in the circuit (And and Inverter). Not only provides properties to deal with don't care and rewrite subtree structure, but further optimization can also further optimization can be easily implemented.

Besides, our methodology optimizes on different aspects. The substitution method replaces some variables with the corresponding equations. For the substructure minimization, Boolean identities are implemented by the pattern comparison. From the perspective of removing unnecessary wires, SAT-solver-based redundant wire checking is used. As node simplification, gate merging decreases the number of functional equivalent gates by combining them. Furthermore, Xformtmp is added to increase the flexibility of the optimization to create potential strategies.

In the prospects, this work will calculate mandatory assignments to implement the redundant addition and removal technique or the minimization methods like fast node merging [5], which completes the node simplification since we already use the stuck-at-fault test. These public testcases will be further compared to the mainstream optimization methods, then learn the different skills from them. Based on the knowledge we learned, a better algorithm will be constructed for faster simulation efficiency. Since the data structure can be viewed as a directed acyclic graph (DAG), a DAG-aware synthesis method [6] probably is a good direction to develop in. The ultimate goal is to build a tool similar to "ABC: A System for Sequential Synthesis and Verification" [7], which provides both functions like logic synthesis and formal verification.

# 8. Reference

- [1] A. Evans et al., "Functional verification of large ASICs," Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175), San Francisco, CA, USA, 1998, pp. 650-655, DOI: 10.1145/277044.277210.
- [2] Y. Deng, "GPU Accelerated VLSI Design Verification," 2010 10th IEEE International Conference on Computer and Information Technology, Bradford, 2010, pp. 1213-1218, DOI: 10.1109/CIT.2010.219.
- [3] K. Ko, T. Liu, "Problem F: Verilog Simulation Optimization via Instruction Reduction", Synopsys, Inc.
- [4] L.s Machado, J. Cortadella, "Boolean Decomposition for AIG Optimization", GLSVLSI '17: Proceedings of the on Great Lakes Symposium on VLSI 2017May 2017 pages 143-148, DOI: 10.1145/3060403.3060420.
- [5] Y.C. Chen and C.Y. Wang, "Fast Node Merging With Don't Cares Using Logic Implications", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Volume: 29, Issue: 11, Nov. 2010), pp. 1827-1832, DOI: 10.1109/TCAD.2010.2058510.
- [6] A. Mishchenko, S. Chatterjee et al., "DAG-Aware AIG Rewriting: A Fresh Look at Combinational Logic Synthesis", 2006 43rd ACM/IEEE Design Automation Conference, San Francisco, CA, USA, 2006, DOI: 10.1145/1146909.1147048.
- [7] A. Mishchenko, S. Chatterjee et al., "FRAIGs: A Unifying Representation for Logic Synthesis and Verification", Department of EECS University of California, Berkeley.