

# Politecnico di Torino

Integrated Systems Architecture

## Report laboratory 1 Design and implementation of a digital filter

Students: Group 40

| Menditto Gianluca       | 274913 |
|-------------------------|--------|
| Orabona Michelangelo    | 275289 |
| Santoro Andrea Giovanni | 274916 |
| Virgilio Raffaele       | 274914 |

Link repository github: https://github.com/Group40ISA/ISA

## Contents

| 1            | $\operatorname{Des}$ | ign an         | d implementation of a digital filter        | 2  |
|--------------|----------------------|----------------|---------------------------------------------|----|
|              | 1.1                  | Refere         | ence model development                      | 2  |
|              | 1.2                  | VLSI           | implementation                              | 3  |
|              |                      | 1.2.1          | Simulation                                  | 4  |
|              |                      | 1.2.2          | Logic Synthesis                             | 5  |
|              |                      | 1.2.3          | Place & Route                               | 7  |
|              | 1.3                  | Advar          | nced architecture development               | 9  |
|              | 1.4                  |                | implementation                              |    |
|              |                      | 1.4.1          | Simulation                                  |    |
|              |                      | 1.4.2          | Logic Synthesis                             |    |
|              |                      | 1.4.3          | Place & Route                               |    |
| $\mathbf{A}$ | VH                   | ${ m DL}$ list | ting                                        | 16 |
|              | A.1                  | Listing        | g of the 1st order IIR not optimized        | 16 |
|              |                      | A.1.1          | Listing of the 1st order IIR with LookAhead | 18 |
| В            | Bac                  | kward          | annotation scripts                          | 27 |
| $\mathbf{C}$ | C li                 | stings         |                                             | 30 |
|              | C.1                  | C mod          | del for L.A. implementation                 | 30 |

## 1 Design and implementation of a digital filter

#### 1.1 Reference model development

The goal of the report is to describe the necessary steps to design a digital filter with a **cut-off frequency** of 2 kHz and a **sampling frequency** of 10 kHz. According to the rules described in the general description of the laboratory: the group number is even, so p = 0 and the filter will be an IIR with order and number of bits given by the two following expression, where x = 8 and y = 7:

$$N = 2^{p}[(x \bmod 2) + 1] + 6p = 1$$
$$n_{b} = (y \bmod 7) + 8 = 8$$

The given Matlab scripts give the filter coefficients and a reference output to test the results of the filter described in C; the coefficients are:

$$b_1 = b_0 = 53$$
  $a_0 = 128$   $a_1 = -21$ 

The coefficient  $a_0$  will be neglected since in fractional point representation will be equal to one. The data flow graph of the IIR filter in **direct form II** is presented in figure 1, as described in the following equation:

$$y[n] = b_0 w[n] + b_1 w[n-1]$$
  
 $w[n] = x[n] - a_1 w[n-1]$ 



Figure 1: Data Flow Diagram of the designed IIR filter in direct form II. The  $Z^{-1}$  is the unitary delay element

After checking the outputs of the filter described by C program, the **Total Harmonic Distortion** (THD) is evaluated (using the Matlab function thd()) and it is equal to:

$$THD = -40.118 dB$$

#### 1.2 VLSI implementation

First of all, the design flow starts with the implementation of the architecture at infinite resources to obtain the upper bound of the accuracy of the filter. Given the THD specification that must be maximum -30 dB, the goal is to minimize the area and this can be done by reducing the internal data representation.

To do that, the partial results of the structure of the C model are analyzed. These latters lead to a change of the internal parallelism of the structure, which decreases from the 16 bit required by the infinite resources implementation to 9 bit.

This is because some partial results exceed the maximum number representable by 8 bits, so it was necessary to add a guard bit to guarantee correct computation of the filter. To validate as said before, the simulation will be performed both with maximum and minimum parallelism, and also the synthesis, to have a comparison between the two implementation and to appreciate the saved area.

In this section some interesting key points are explained in details, while the whole VHDL implementation is reported in Appendix A.1.

In according to the following figure:



Figure 2: Filter interface

the samples (DIN) are generated one each clock cycle with a validation signal (VIN). When VIN ='1', a new sample is loaded into the architecture, in fact this input is used into load enable port of the input register. The output DOUT contains the result of the filtering and it is correct when VOUT='1', this is available after one clock cycle of latency from the corresponding entering sample. To uncouple the structure with the external, two barrier registers are used: one for DIN and another for DOUT. The signal VIN and VOUT have a temporary register to synchronize with the right timing of the data since they have an input and output register. Moreover, VIN signal, after the input register is used to disable feedback register when the input data are not valid so to not compromise the internal result.

Since a given parallelism in output and input is required, but internally it is needed to work with one more bit, some approximations are necessary and to understand how to do

these, a digression on fractional representation is useful: generally a number in fractional can be represented as

 $\frac{a}{2^y}$ 

where a is the integer value and y is the number of bit after the decimal point. When two numbers in fractional are multiplied, it is obtained:

$$\frac{a}{2^y} \cdot \frac{b}{2^y} = \frac{ab}{2^{2y}}$$

This result means, as said before, the internal parallelism is higher and so double precision is required; in particular this leads to a change of weights of the bits, so the MSB bit needn't be taken into account in order to restore the weights. The approach to these approximations in VHDL follows different rules from the implementation in C:

- Multiplications: after the multiplication we have a number represented on 17 bits, the MSB, as expressed before, and the less significant seven bits are truncated; this operation allows to have the results of the multiplications on nine bits.
- Output: the introduced guard bit is removed.

#### 1.2.1 Simulation

At this point, the system described so far is simulated with *Modelsim* in order to verify its behavior. The figure 3 shows the circuit works correctly even when VIN from one goes to zero: the input data is not sampling by the input register; nevertheless all the operators inside are still working, so to avoid to produce a wrong result, also VOUT goes to low level. In fact, as seen in the figure below, DOUT contains a wrong result (35) among two correct ones but it is not present in the output file, like clear in the table below.



Figure 3: Correct in coming data sampling

| n° | 1 | 2  | 3  | 4  | 5  |
|----|---|----|----|----|----|
| С  | 0 | 16 | 18 | 45 | 50 |

Table 1: Extract of first some results

In closing the results produced are compared with the ones obtained in C language, and they are the same, as required, as showed in figure 4:



Figure 4: Some first results

#### 1.2.2 Logic Synthesis

After verifying the correct behavior of the circuit, the next step is the synthesis. Before that, some constraints have to be applied: first of all, the clock period is set at  $10 \ ns$ ; then to be closer to reality, uncertainty of clock, both input and output delay and the load of each output are applied. Hence the synthesis is performed in order to find the maximum clock period, which result is showed in report extract inserted below:

| data required time | 14.42 |
|--------------------|-------|
| data arrival time  | -3.60 |
|                    |       |
| slack (MET)        | 10.81 |

From this data, the maximum clock period is obtained:

$$T_{clk_{max}} = data \ arrival \ time + libray \ setup \ time = 3.60 + 0.03 = 3.63 \ ns$$

and observing the specifics, the period is set as below

$$T_{clk} = T_{clk_{max}} * 4 = 14.52 \ ns.$$

Backward annotation: From now on, all the analysis were done by considering this clock period, by which all the backward annotation process is done: after having analyzed elaborated and compiled the current design, the synthesizer generetes the Standard Delay Format file (SDF) which contains detailed information about the delay of the circuit, and the Verilog netlist generated with a specific technological library. Once Modelsim has read the .sdf file and after the simulation, it generates the Value Change Dump (VCD), in which are annotated all the toggles of nodes. Since Synopsys is able to read .saif file containing the same information of the .vcd file, we need to convert it. In the last step the circuit power is analyzed by considering the .saif file and the netlist generated before. All these steps are done automatically by using the scripts reported in appendix B.

**Analysis of the results:** To be able to do a comparison, the above process is performed also for the structure developed with maximum resources.

#### Area with reduced parallelism Area at infinite resources

Total cell area: 1028.622009 Total cell area: 1407.140012

Power with reduced parallelism [uW] Power at infinite resources [uW]

Cell Internal Power = 87.4379 Cell Internal Power = 122.9963 Net Switching Power = 57.6360 Net Switching Power = 96.0296

Total Dynamic Power = 145.0740 Total Dynamic Power = 219.0259

Cell Leakage Power = 20.5863 Cell Leakage Power = 28.2748

As expected, it is obtained an area reduction about 27% than to other solution: it happens because operators and registers are more simple and so this leads to lower power dissipation, that decreases about 24%. In details, the main reduction is related to *Net switching power*, which decreases because a lower number of toggling nodes thanks to the reduced parallelism of the architecture.

#### 1.2.3 Place & Route

Now, the place and route (PnR) is performed. At the beginning, Innovus set the area of the cell, and then several steps are realized to complete the process. The first one is the insertion of power rings: one for  $V_{dd}$  and one for  $V_{ss}$  and so the horizontal wires to connect them to all cells. So the placement and routing is realized, and in figure 5 is presented the final result of the PnR.



Figure 5: Final result of the PnR

At this point, the same analysis done during synthesis are repeated again, in particular the backward annotation is performed by employing *Innovus*' verilog netlist. After that, the area and power reports are obtained with the optimization made by Innovus. The extract of them are reported below:

#### Report area

Combinational area: 884.982005 Buf/Inv area: 46.018000 Noncombinational area: 143.640005 Macro/Black Box area: 0.000000

Total Cell Area: 1028.622009

#### Report power

Cell Internal Power = 94.5434 uW Net Switching Power = 62.3197 uW

Total Dynamic Power = 156.8631 uW

Cell Leakage Power = 20.5695 uW

Definitively, as seen, the area obtained is the same one produced during synthesis, and instead about the estimation of power, in this case is possible to notice a tolerable increase respect to the previous one due to the higher precision in the generation of .sdf file.

#### 1.3 Advanced architecture development

Now the goal is to improve the previous structure, increasing throughput. To achieve this purpose, the method used is the "Look ahead". Before applying this approach, it is useful to refer to some theoretical issues. Considering the figure 1:

$$T_{cp} = T_m + T_a$$

where  $T_a$  denotes the delay of adder and  $T_m$  the delay of multiplier; evaluating the *iteration* bound:

$$T_{inf} = T_m + T_a$$

For this reason, no improvement is possible because  $T_{inf} = T_{ck}$ . At this point the *look ahead method* become significant: in fact, it allows to substitute recursive equation in the main one and so, the final structure has

$$T_{inf} < T_{clk}$$

and some enhancement are possible. Let's apply the method:

$$y[n] = b_0 w[n] + b_1 w[n-1]$$

$$w[n] = x[n] - a_1 w[n-1]$$

Recursively:

$$w[n-1] = x[n-1] - a_1w[n-1]$$

So substituting in the y[n] equation:

$$y[n] = b_0 x[n] - a_1 b_0 x[n-1] + b_0 a_1^2 w[n-2] + b_1 x[n-1] - a_1 b_1 w[n-2]$$
  
$$y[n] = b_0 (x[n] - a_1 x[n-1] + a_1^2 w[n-2]) + b_1 x[n-1] - a_1 b_1 w[n-2]$$
(1)

The architecture which implements the previous equation is:



Figure 6: The structure after look ahead

In this case the iteration bound, as said before, is lower:

$$T_{inf} = (T_a + T_m)/2$$

and it means that now universal techniques can be applied.

At this point, it is need to choose the number of pipeline stages to achieve the maximum throughput. Initially, the pipe stages are inserted keeping mind to save area, so the level II, III and IV are fixed. Also, analyzing the structure, notice that there are some registers sequentially in the same branch to balance properly the delay line of all other branches.



Figure 7: The structure with highlighted pipeline stage

But as showed in the figure 7,  $T_{inf}$  is still the same, and so it's possible to continue the improvement using another universal technique that is the *retiming*. Let's see how move the registers by *cut-set retiming*; a *cutset* is a group of edges in DFG which when removed from graph, they are able to divide the latter in two parts no more connected. In this case, observing the loop, it is possible to identify a cutset as showed in figure 8: in according to this approach, if a register is removed by an edge that links DFG 1 (otherwise the portion of DFG circled in red), to DFG 2 (the remaining part), the registers have to be inserted in the edges which link the DFG parts in opposite way; so it implies to insert two registers after the multiplier circled, one on the right and one on the left. The figure 9 illustrates the whole structure after retiming.

At this point, another consideration is needed; in fact now the loop is no more a critical path that instead is represented by the link between multiplier and adder on the left of the structure and by the vertical edge that links the adder to multiplier on the right of DFG (see the red edges in figure 9): for this reason it is efficient to add other two pipe stages as showed in figure 10 which allows to achieve:

$$T_{cp} = max(T_a, T_m)$$



Figure 8: Cutset retiming



Figure 9: DFG after retiming



Figure 10: Final DFG

#### 1.4 VLSI implementation

The listings of the VHDL code implementing the filter and all the required component are reported in Appendix A.1:

- Reg & Reg\_en: they are the register used to implement the pipe level and the delay element in the structure. The registers with the enable are required to allow the unit to work well when VIN goes low.
- Shift register: used for the synchronization of the VOUT with the output and to drive the enable of the registers.

About the new structure, there are some considerations to do. At first, the interface, as required remains the same, but there are two new coefficients  $(a_1^2, -a_1b_1)$  as seen in the equation 1. Moreover, since several pipeline registers are inserted, and they lead to synchronization problems:

- one related to VOUT that denotes a valid data at output, so it needs to be delayed by a shift register that implements the pipe levels (exactly five levels).
- one related to VIN in particular when it toggles from high level to low for a given number of clock edges, a bubble propagates in the shift register and it leads to stop properly the registers, to avoid them to be corrupted by incorrect data. To optimize the area, the stopped registers are only the feedback registers;

Regarding the internal parallelism of the structure, reproducing the same procedure done for the characterization of the previous filter, an internal parallelism of 9 bit is obtained, so truncations at the output of the multiplier and at the output of the filter are required (done with the same procedure described before).

#### 1.4.1 Simulation

At this point, the system described so far is simulated with Modelsim in order to verify its behavior. The figure 11 shows the correct behavior of the filter, and also some results. As clear the result of the structure are slightly different from the previous one, that's due to the necessary approximation for the new coefficients. To do a proper comparison with a C model,

the given one is modified to implement the same equation obtained by the Look Ahead method. (The code is reported in the appendix C.1). In general the structure lead to a THD far above the specification:

$$THD = -42.82dB$$

In the figure 12 it is possible to see how the system works properly even if VIN goes low for one clock edges, as expressed before.



Figure 11: Simulation of the first few cycles of the filter, the output in this case is denoted by 'y'



Figure 12: Simulation of the filter when VIN goes low for some clock edges

#### 1.4.2 Logic Synthesis

Once again, the next step is the synthesis of the filter. Before all, it's set the clock period, used for the synthesis at 10 ns, and with this analysis it's obtained the minimum clock period achievable by this structure, given by the sum of the data arrival time and the library set up time.

| data required time | 9.89  |
|--------------------|-------|
| data arrival time  | -2.24 |
|                    |       |
| slack (MET)        | 7.65  |

As expected the maximum achievable frequency increases by 37% with respect to the previous structure, since the critical path is reduced.

Applying the same constraints on the clock period  $(4 * T_{min})$  the backward annotation process is performed, leading to the following results concerning area and power consumption:

#### Report area Report power

Combinational area: 1535.884000 Cell Internal Power = 242.7182 uW Buf/Inv area: 67.032001 Net Switching Power = 99.0343 uW

Noncombinational area: 936.320030

Macro/Black Box area: 0.000000 Total Dynamic Power = 341.7525 uW

## Total Cell Area: 2472.204036 Cell Leakage Power = 47.5412

As expected the performance is at the expense of area and power consumption: the area has more than doubled, compared to the previous implementation, in fact it increases by 147% due to the more components used, which are pipe registers, multiplier, adder.

About the power there is an increasing of the *total dynamic power* by 135% and *cell leakage power* by 130%, these are due to the several toggling nodes in the structure for dynamic power, and more operator used relative to the leakage.

#### 1.4.3 Place & Route

Finally the PnR is done, leading to the layout presented in the figure below:



Figure 13: Layout of the IIR filter optimized by the Look Ahead method

The backward annotation process is then repeated by using the .sdf file generated by *Innovus* and the report about the area and the power consumption are showed below:

# Report area Report power Combinational area: 1535.884000 Cell Internal Power = 260.1618 uW Buf/Inv area: 67.032001 Net Switching Power = 110.4398 uW Noncombinational area: 936.320030 Macro/Black Box area: 0.000000 Total Dynamic Power = 370.6016 uW

Total Cell Area: 2472.204036 Cell Leakage Power = 47.4530

As we can see, and for the same reasons described in the previous PnR section, we have an increase in power by 9% with respect to the report power done by Synopsys.

## A VHDL listing

#### A.1 Listing of the 1st order IIR not optimized

```
1 library ieee;
use ieee.std_logic_1164.all;
3 use ieee.numeric_std.all;
5 entity filter is
   port(
      clk, rst : in std_logic;
vin : in std_logic;
8
      Din, b0, a1 : in signed(7 downto 0);
9
     vout : out std_logic;
Dout : out signed(7 downto 0)
11
    );
12
13 end entity filter;
14
15 architecture RTL of filter is
16
    signal m1, m2, ff
                                     : signed(8 downto 0);
17
   signal m1_tmp, m2_tmp, ff_tmp : signed(16 downto 0);
   signal tmpfb, fb
                                   : signed(8 downto 0);
19
   signal w, s1
                                     : signed(8 downto 0);
20
    signal v_out, v_in, tmprvin : std_logic;
21
    signal tmpDin
                                     : signed(7 downto 0);
    signal tmpDout
                                     : signed(7 downto 0);
23
   signal internalDin
                                    : signed(8 downto 0);
24
   signal extension
                                    : signed(0 downto 0);
    signal tmprv_out
                                    : std_logic;
27 begin
28
    registerrvout : process(clk)
29
30
   begin
     if rst = '0' then
31
       tmprv_out <= '0';</pre>
32
     elsif (clk'event and clk = '1') then
       tmprv_out <= v_out;</pre>
35
     end if;
    end process registerrvout;
36
    vout <= tmprv_out;</pre>
37
    registerrvi : process(clk)
39
40
    begin
    if rst = '0' then
41
       tmprvin <= '0';</pre>
42
     elsif (clk'event and clk = '1') then
43
        tmprvin <= vin;</pre>
44
     end if;
    end process registerrvi;
46
    v_in <= tmprvin;</pre>
47
```

```
registerfb : process(clk)
    begin
50
     if rst = '0' then
51
       tmpfb <= (others => '0');
52
      elsif (clk'event and clk = '1') then
        if v_in = '1' then
54
          tmpfb <= w;</pre>
55
        end if;
56
     end if;
57
    end process registerfb;
58
    fb <= tmpfb;</pre>
59
60
61
    registersDin : process(clk)
                                        --@suppress
    begin
62
     if rst = '0' then
63
       tmpDin <= (others => '0');
     elsif (clk'event and clk = '1') then
       if vin = '1' then
66
          tmpDin <= Din;</pre>
67
        end if;
68
      end if;
    end process registersDin;
70
    -- sign extension to conform the parallelism of the
71
    -- input to the parallelism of the internal repres.
72
    extension <= (others => tmpDin(7));
73
    internalDin <= extension & tmpDin;</pre>
74
75
    registersDout : process(clk)
                                         --@suppress
76
77
   begin
     if rst = '0' then
78
       tmpDout <= (others => '0');
79
      elsif (clk'event and clk = '1') then
80
       tmpDout <= s1(7 downto 0);</pre>
81
      end if:
82
    end process registersDout;
83
    Dout <= tmpDout;</pre>
84
86
    -- This blocks the output whenever
87
    -- the input are not correct, since the block
    -- after samples only when Vout = 1.
89
90
    v_out <= v_in;</pre>
91
    ______
    -- the implemented equations are:
93
    -- y = b0*w(n) + b1*w(n-1)
94
    -- w = x(n) - a1*w(n-1)
95
    m1_tmp <= fb * a1;
                                    --result moltiplication need 17 bit
97
   m1 <= m1_tmp(15 downto 7); --restoring on 9 bit the result. It's 15
    downto 7
   w <= internalDin - m1; -- and not 16 downto 8 because the dot
```

#### A.1.1 Listing of the 1st order IIR with LookAhead

Listings of the registers without enable

```
1 library ieee;
use ieee.std_logic_1164.all;
3 use ieee.numeric_std.all;
4 -- This is a register whitout load enable signal --
5 entity reg is
6
    generic(parallelism : integer := 9);
7
    port(
      clk : in std_logic;
8
     rst : in std_logic;
9
     D : in signed( parallelism - 1 downto 0);
          : out signed( parallelism - 1 downto 0)
11
    );
12
  end entity reg;
13
14
15 architecture RTL of reg is
16
    signal tmp : signed( parallelism - 1 downto 0);
17
18
19 begin
    regprocess : process(clk, rst)
                                          --@suppress
20
21
   begin
22
      if rst = '0' then
       tmp <= (others => '0');
23
     elsif clk'event and clk = '1' then
24
          tmp <= D;
     end if;
27
   end process;
    Q <= tmp;
29 end architecture RTL;
```

#### Listings of the registers with enable

```
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
-- This is a register that has a load enable signal to save the data --
entity reg_en is
generic(parallelism : integer := 9);
```

```
port(
      clk : in
                std_logic;
      rst : in std_logic;
9
          : in signed(parallelism - 1 downto 0);
10
          : out signed(parallelism - 1 downto 0);
11
          : in std_logic
12
      en
    );
13
end entity reg_en;
16 architecture RTL of reg_en is
    signal tmp : signed(parallelism - 1 downto 0);
18
19
20 begin
    regprocess : process(clk, rst)
                                          --@suppress
21
22
    begin
     if rst = '0' then
        tmp <= (others => '0');
24
     elsif clk'event and clk = '1' then
25
        if en = '1' then
26
          tmp <= D;
        end if;
28
      end if;
29
    end process;
    Q <= tmp;
31
32 end architecture RTL;
```

#### Listings of the shift register used to synchronize $V_{out}$

```
1 library ieee;
use ieee.std_logic_1164.all;
3 use ieee.numeric_std.all;
 entity shift is
    generic(levelPipe : integer := 5);
6
    port (
     clk : in std_logic;
     rst : in std_logic;
     D : in std_logic;
10
      Q : out std_logic;
11
      enable : out std_logic_vector(1 downto 0)
12
    );
13
14 end entity shift;
16 architecture RTL of shift is
17
    signal tmp : std_logic_vector(levelPipe - 1 downto 0);
18
19
20 begin
21
    process(clk, rst)
    begin
22
    if rst = '0' then
23
  tmp <= (others => '0');
```

#### Listings of the filter

```
1 library ieee;
use ieee.std_logic_1164.all;
3 use ieee.numeric_std.all;
5 entity filter is
    generic(parallelism : integer := 8);
    port(
7
      clk
                         : in std_logic;
      rst
                         : in std_logic;
9
      BO, A1, C, A_sqrd : in std_logic_vector(parallelism - 1 downto 0);
                         : in std_logic;
11
                         : in std_logic_vector(parallelism - 1 downto 0);
12
      х
                         : out std_logic_vector(parallelism - 1 downto 0);
13
      У
14
      vout
                         : out std_logic
15
16 end entity filter;
17
  architecture RTL of filter is
19
20
    component reg is
21
      generic(parallelism : integer := 9);
      port(
23
        clk : in std_logic;
24
25
        rst : in std_logic;
        D : in signed(parallelism - 1 downto 0);
            : out signed(parallelism - 1 downto 0)
27
      );
28
29
    end component reg;
    component reg_en is
31
      generic(parallelism : integer := 9);
32
      port(
33
        clk : in std_logic;
34
        rst : in std_logic;
35
           : in signed(parallelism - 1 downto 0);
36
            : out signed(parallelism - 1 downto 0);
37
38
           : in std_logic
39
    end component reg_en;
40
```

```
component shift is
      generic(levelPipe : integer := 5);
43
      port(
44
        clk
                : in std_logic;
45
                : in std_logic;
        rst
                : in std_logic;
47
               : out std_logic;
48
        enable : out std_logic_vector(1 downto 0)
49
      );
51
    end component shift;
52
53
    signal extension
                                                     : signed(0 downto 0);
    signal concatenated
                                                     : signed(8 downto 0);
55
                                                     : signed(parallelism - 1
    signal x_internal
56
    downto 0);
                                                     : signed(8 downto 0);
    signal x_minus_1
    signal m1_tmp, m2_tmp, m3_tmp, m4_tmp, m5_tmp : signed(16 downto 0);
58
    signal m1, m2, m3, m4, m5
                                                     : signed(8 downto 0);
59
    signal s1, s2, s3, s4
                                                     : signed(8 downto 0);
60
    signal out_reg_2, out_reg_3
                                                     : signed(8 downto 0);
                                                     : signed(8 downto 0);
    signal out_reg_4
62
    signal out_pipe_1, out_pipe_2, out_pipe_3
                                                     : signed(8 downto 0);
63
    signal out_pipe_4, out_pipe_5
                                                     : signed(8 downto 0);
    signal out_pipe_6, out_pipe_7, out_pipe_13
                                                    : signed(8 downto 0);
65
                                                     : signed(8 downto 0);
66
    signal out_pipe_8, out_pipe_9, out_pipe_10
    signal out_pipe_11, out_pipe_12
                                                     : signed(8 downto 0);
67
    signal s_v, v_in, tmp_vin
                                                     : std_logic;
69
    signal y_signed
                                                     : signed(7 downto 0);
    signal x_signed
                                                     : signed(7 downto 0);
70
    signal enable
                                                     : std_logic_vector(1 downto
71
     0);
72
73 begin
74
    in_regs : reg_en
75
      generic map(
76
        parallelism => 8
77
78
      port map(
        clk => clk,
80
        rst => rst,
81
           => x_signed,
        D
82
           => (x_internal),
        Q
        en => Vin
84
      );
85
    x_signed <= signed(x);</pre>
86
87
    extension(0) <= x_internal(7);</pre>
88
    concatenated <= extension & x_internal;</pre>
89
90
91 x_n_minus_1 : reg
```

```
generic map(
         parallelism => 9
93
94
       port map(
95
         clk => clk,
         rst => rst,
97
              => (concatenated),
98
              => (x_minus_1)
99
       );
100
101
     out_reg : reg_en
       generic map(
103
         parallelism => 8
104
       port map(
106
         clk => clk,
         rst => rst,
         D => (s3(7 downto 0)),
109
            => y_signed,
         Q
110
         en => enable(0)
111
     y <= std_logic_vector(y_signed);</pre>
113
114
     v_out_pipe : shift
115
       generic map(
116
         levelPipe => 6
117
118
       port map(
119
120
         clk
                       => clk,
         rst
                       => rst,
121
         D
                       => s_v,
         Q
                       => vout,
         enable => enable
124
       );
126
     reg_2 : reg_en
127
       generic map(parallelism => 9)
128
       port map(
129
         clk => clk,
130
         rst => rst,
             => (m3),
132
              => (out_reg_2),
         Q
         en => enable(1)
134
       );
136
     reg_3 : reg_en
       generic map(parallelism => 9)
137
       port map(
138
         clk => clk,
         rst => rst,
140
         D
              => (s2),
141
              => (out_reg_3),
         Q
142
         en => enable(1)
143
```

```
);
144
     reg_4 : reg
145
       generic map(parallelism => 9)
146
       port map(
147
         clk => clk,
         rst => rst,
149
            => (m4),
150
              => (out_reg_4)
         Q
       );
152
153
     -----Vin Register-----
154
     reg_vin : process(clk)
155
156
     begin
       if rst = '0' then
157
         tmp_vin <= '0';
158
       elsif (clk'event and clk = '1') then
         tmp_vin <= Vin;</pre>
       end if;
161
     end process reg_vin;
162
     v_in <= tmp_vin;</pre>
163
164
165
     pipe_1 : reg
166
       generic map(
         parallelism => 9
168
169
       port map(
170
         clk => clk,
171
         rst => rst,
172
             => (concatenated),
173
         Q
              => (out_pipe_1)
174
       );
175
176
     pipe_2 : reg
177
       generic map(
178
         parallelism => 9
179
180
       port map(
181
         clk => clk,
182
         rst => rst,
         D
            => (m2),
184
         Q
              => (out_pipe_2)
185
       );
186
     pipe_3 : reg
188
       generic map(
189
         parallelism => 9
190
       )
191
       port map(
192
         clk => clk,
193
         rst => rst,
194
         D => (m5),
195
```

```
=> (out_pipe_3)
          Q
       );
197
198
     pipe_4 : reg
199
       generic map(
         parallelism => 9
201
202
       port map(
203
         clk => clk,
          rst => rst,
205
              => (s1),
206
              => (out_pipe_4)
          Q
207
208
       );
209
     pipe_11 : reg
210
       generic map(
211
          parallelism => 9
213
       port map(
214
         clk => clk,
          rst => rst,
              => (out_pipe_8),
217
          Q
              => (out_pipe_11)
218
       );
219
220
     pipe_12 : reg
221
       generic map(
222
          parallelism => 9
223
224
       port map(
225
         clk => clk,
226
         rst => rst,
         D
              => (s4),
228
              => (out_pipe_12)
229
       );
230
231
     pipe_5 : reg
232
       generic map(parallelism => 9)
233
       port map(
234
         clk => clk,
          rst => rst,
236
          D
              => (out_pipe_3),
237
              => (out_pipe_5)
          Q
238
       );
239
240
     pipe_6 : reg
       generic map(parallelism => 9)
241
       port map(
242
         clk => clk,
          rst => rst,
244
              => (s2),
         D
245
              => (out_pipe_6)
          Q
246
       );
```

```
pipe_13 : reg
       generic map(parallelism => 9)
249
       port map(
250
         clk => clk,
251
         rst => rst,
              => (out_reg_3),
253
              => (out_pipe_13)
254
       );
255
     pipe_7 : reg
       generic map(parallelism => 9)
257
       port map(
         clk => clk,
259
260
         rst => rst,
              => (out_pipe_5),
261
         Q
              => (out_pipe_7)
262
       );
263
     pipe_8 : reg
       generic map(parallelism => 9)
265
       port map(
266
         clk => clk,
267
         rst => rst,
268
              => (m1)
269
              => (out_pipe_8)
270
       );
271
     pipe_9 : reg
272
       generic map(parallelism => 9)
273
       port map(
274
         clk => clk,
         rst => rst,
276
              => (out_reg_4),
277
         Q
              => (out_pipe_9)
278
       );
     pipe_10 : reg
280
       generic map(parallelism => 9)
281
       port map(
282
         clk => clk,
         rst => rst,
284
              => (out_pipe_7),
285
286
         Q
              => (out_pipe_10)
       );
287
288
     m1_tmp <= out_pipe_6 * signed(B0);</pre>
290
     m2_tmp <= x_minus_1 * signed(A1);</pre>
291
     m3_tmp <= out_reg_3 * signed(A_sqrd);</pre>
292
     m4_tmp <= out_pipe_13 * signed(C);</pre>
     m5_tmp <= x_minus_1 * signed(B0);</pre>
294
295
     m1 <= m1_tmp(15 downto 7); -- truncation of the mult</pre>
296
     m2 <= m2_tmp(15 downto 7); -- to adapt with the internal</pre>
```

```
m3 <= m3_tmp(15 downto 7); -- parallelism</pre>
299
     m4 <= m4_tmp(15 downto 7);
     m5 <= m5_tmp(15 downto 7);</pre>
300
301
     s1 <= out_pipe_1 - out_pipe_2;</pre>
     s2 <= out_pipe_4 + out_reg_2;</pre>
303
     s3 <= out_pipe_11 + out_pipe_12;
304
     s4 <= out_pipe_10 - out_pipe_9;</pre>
305
307
     s_v <= v_in;
309
310 end architecture RTL;
```

### B Backward annotation scripts

This script (create\_sdf.scr) is required to compile all the components on synopsys and to create the .sdf file and the verilog netlist, require for modelsim to annotate the activity of all the internal nodes.

```
#CREATE_SDF.SCR
variable targetcompilation "../src/filter.vhd"
3 variable top_hierarchy "filter"
4 variable clock "10.0"
6 analyze -format vhdl $targetcompilation
7 set power_preserve_rtl_hier_names true
9 elaborate $top_hierarchy -arch RTL -lib work > elaborate_transcript.txt
10 create_clock -name my_clk -period $clock clk
set_dont_touch_network my_clk
13 set_clock_uncertainty 0.07 [get_clocks my_clk]
14 set_input_delay 0.5 -max -clock my_clk [remove_from_collection [all_inputs]
15 set_output_delay 0.5 -max -clock my_clk [all_outputs]
17 set OLOAD [load_of NangateOpenCellLibrary/BUF_X4/A]
18 set_load $OLOAD [all_outputs]
19
20 compile
21
22 sh mkdir clock_$clock
23 report_timing > clock_$clock/report_timing.txt
24 report_area > clock_$clock/report_area_vhd.txt
25 ungroup -all -flatten
26 change_names -hierarchy -rules verilog
27 write_sdf ../netlist/$top_hierarchy.sdf
 write -f verilog -hierarchy -output ../netlist/$top_hierarchy.v
vrite_sdc ../netlist/$top_hierarchy.sdc
```

This script (fill\_forward.scr) is used by Modelsim to create the .vcd file, that needs to be converted in .saif to allow synopsys to use it and to report the precise power.

```
#FILL_FORWARD.SCR
#setting env

rm -rf ../sim/work
vlib work

# compile

vcom -93 -work ./work ../tb/clk_gen.vhd
vcom -93 -work ./work ../tb/data_maker_new.vhd
vcom -93 -work ./work ../tb/data_sink.vhd

vlog -work ./work ../netlist/filter.v
```

```
volog -work ./work ../tb/tb_fir.v

vsim -L /software/dk/nangate45/verilog/msim6.2g -sdftyp /tb_fir/UUT=../
    netlist/filter.sdf work.tb_fir

recate sdf

vcd file ../vcd/filter.vcd
vcd add /tb_fir/UUT/*

run simulation
run 3 ms
```

After converting the .vcd in .saif, this last script, uses this file in order to give accurate information on the power dissipation by the structure.

```
# BACKWARD.SCR
variable clock "10.0"

sh vcd2saif -input ../vcd/filter.vcd -output ../saif/filter.saif

create_clock -name my_clk -period $clock clk

read_verilog -netlist ../netlist/filter.v
read_saif -input ../saif/filter.saif -instance tb_fir/UUT -unit ns -scale 1

report_power > clock_$clock/report_power.txt
report_area > clock_$clock/report_area_Verilog.txt
```

These script are also used for the back annotation performed after the generation of the netlist by *Innovus*, except for create\_sdf.scr replace by the following script in which the command related to the creation of the .sdf file and the netlist are removed (since they are created by innovus itself)

```
19
20 compile
21
22 sh mkdir clock_$clock
23 report_timing > clock_$clock/report_timing.txt
24 report_area > clock_$clock/report_area_vhd.txt
```

## C C listings

#### C.1 C model for L.A. implementation

```
#include < stdio.h>
#include < stdlib.h>
4 #define NB 8 /// number of bits
5 #define N 1 // order of the filter
7 const int b0 = 53; /// coefficient b0
8 const int b[N]={53}; /// b array
9 const int a[N]={-21}; /// a array
10 const int ab=a[0]*b[0] >>(NB-1);
11 const int a_sqrd=a[0]*a[0] >>(NB-1);
_{
m 13} /// Perform fixed point filtering assuming direct form II
14 ///param x is the new input sample
15 ///return the new output sample
int myfilter(int x)
17 {
    static int cont=0;
    static int sw[2]; /// w shift register
19
    static int sx[1];/// shift register della x
20
    static int first_run = 0; /// for cleaning the shift register
21
    int i; /// index
    int w; /// intermediate value (w)
23
    int y; /// output sample
24
    int fb, ff; /// feed-back and feed-forward results
25
    FILE *fd; // this file allow to know intermedial results.
    int m1,m2,m3,m4,m5; // temponary variable.
27
28
    // opening the file
29
30
    fd=fopen("val_inter.txt","a");
31
    /// clean the buffer
32
    if (first_run == 0)
33
      first_run = 1;
35
      for (i=0; i<N+1; i++)</pre>
36
        sw[i] = 0;
      for (i=0; i<N; i++)</pre>
38
        sx[i] = 0;
39
    }
40
    /// compute feed-back and feed-forward
42
    fb = 0;
43
    ff = 0;
44
    for (i=0; i<N; i++)</pre>
45
46
      m2 = ((sx[i]*a[i])
                              >> (NB-1));
47
      m3= ((sw[i+1]*a_sqrd) >> (NB-1));
```

```
m4 = ((sw[i+1]*ab)
                                                                                          >> (NB-1));
                      m5 = ((sx[i]*b[i])
                                                                                                         >> (NB-1));
50
                     fb = -( -m2 + m3 );
51
                      ff += ( m5 - m4);
52
53
54
               /// compute intermediate value (w) and output sample
55
               w = x + fb;
56
               y = (w*b0) >> (NB-1);
57
58
               /* check delle op. intermedie */
59
               \texttt{fprintf(fd,"For the \%d samples: s1:\%d s2:\%d s4:\%d m1:\%d m2:\%d m3:\%d m4:\%d 
                  m5:%d\n", cont, w-m3, w, ff, y, m2, m3, m4, m5);
               fclose(fd);
61
               cont+=1;
62
63
               // compute the filtered result
               y += ff;
65
66
67
               /// update the shift register
               for (i=N-1; i>0; i--)
69
                      sx[i] = sx[i-1];
70
               sx[0] = x;
71
               for (i=N; i>0; i--)
72
                      sw[i] = sw[i-1];
73
               sw[0]=w;
74
75
76
               return y;
77 }
78
79 int main (int argc, char **argv)
       {
80
               FILE *fp_in;
81
               FILE *fp_out;
82
               int x;
84
               int y;
85
86
               /// check the command line
87
               if (argc != 3)
88
89
                      printf("Use: %s <input_file> <output_file>\n", argv[0]);
90
                       exit(1);
91
92
93
               /// open files
94
               fp_in = fopen(argv[1], "r");
               if (fp_in == NULL)
96
97
                      printf("Error: cannot open %s\n",argv[1]);
98
                      exit(2);
```

```
100
     fp_out = fopen(argv[2], "w");
101
102
     /// get samples and apply filter
103
     fscanf(fp_in, "%d", &x);
105
     y = myfilter(x);
106
     fprintf(fp_out,"%d\n", y);
107
      fscanf(fp_in, "%d", &x);
108
     } while (!feof(fp_in));
109
     fclose(fp_in);
110
     fclose(fp_out);
111
112
113
     return 0;
114
115 }
```