| Design of Digital Circuits: Lab Report |                                  |                        |  |  |
|----------------------------------------|----------------------------------|------------------------|--|--|
| LAB 9 – The Performance of MIPS        |                                  |                        |  |  |
| Date                                   | 29. May 28, 2019                 | Grade                  |  |  |
| Names                                  | Sven Pfiffner & David Zollikofer |                        |  |  |
|                                        |                                  | Lab session / lab room |  |  |
|                                        |                                  | Fri 10-12 / HG E26.1   |  |  |

You have to submit this report via Moodle.

Use a zip file or tarball that contains the report and any other required material. Only one member from each group should submit the report. All members of the group will get the same grade.

The name of the submitted file should be LabN\_LastName1\_LastName2.zip (or .tar), where LastName1 and LastName2 are the last names of the members of the group.

Note 1: Please include all the required material. No links/shortcuts are accepted.

Note 2: The deadline for the report is a hard deadline and it will not be extended.

## Exercise 1

For the following values of A and B, how many clock cycles are needed to execute your first program from Lab 7 on your baseline MIPS processor, before adding optimizations of Lab 9? Assuming that we run the MIPS processor at 20 MHz, how much time (in seconds) would that take?

| Value of A  | Value of B  | Number of cycles           | Time in seconds                   |
|-------------|-------------|----------------------------|-----------------------------------|
| 0           | 8           | 38                         | $38*1/20$ MHz = $1.9*10^{(-6)}$ s |
| 6           | 8           | 14                         | $14*1/20$ MHz = $7*10^{-7}$ s     |
| 0           | 250'000'000 | 1'000'000'007 <sup>1</sup> | 1'000'000'007*1/20MHz = 50s       |
| 249'999'996 | 250'000'002 | $32^{2}$                   | $32*1/20$ MHz = $1.6*10^{(-6)}$ s |

```
We will use the following code:
```

```
main:
    # initialize the two registers
    addi $t0, $zero, 0
    addi $t1, $zero, 1000
    add $t2, $zero, $zero

loop: add $t2, $t2, $t0
    beq $t0, $t1, end
    addi $t0, $t0, 1
    j loop

end:
    j end # Infinite loop at the end of the program.
```

We note that initialization takes 3 steps if the numbers fit into the immediate 16 bits. If not, it will take two steps per load. (Please refer to exercise 2 where we will discuss this in detail). After initialization we will increment the counter a total of (B-A) times which will result in the execution of (B-A) times the loop body (4 instructions). Furthermore, we will have another two instructions in the final loop pass where we execute the add at the loop as well as the beq, after which we will have the result of our calculation. Hence the formula for normal sized numbers (fitting into the immediate) is 4 + 4(B-A) + 2.

We are now running x instructions at 20 MHz, as a result it will take us x\*1/20MHz seconds to execute the program.

<sup>&</sup>lt;sup>1</sup>+1 cycle since B doesn't fit into immediate

<sup>&</sup>lt;sup>2</sup> +2 cycles since A & B don't fit into immediate → see Ex 2 for more details on why we need the extra cylce

## Exercise 2

Fill in the new values for the Table in Exercise 1 when using the modified MIPS architecture running the optimized code, as discussed in the manual for Lab 9.

| Value of A  | Value of B  | Number of cycles | Time in seconds                   |
|-------------|-------------|------------------|-----------------------------------|
| 0           | 8           | 11               | $11*1/20$ MHz = $5.5*10^{(-7)}$ s |
| 6           | 8           | 11               | $11*1/20$ MHz = $5.5*10^{(-7)}$ s |
| 0           | 250'000'000 | 11+1 = 12        | $12*1/20$ MHz = $6*10^{(-7)}$ s   |
| 249'999'996 | 250'000'002 | 11 + 1 + 1 = 13  | $13*1/20$ MHz = $6.5*10^{(-7)}$ s |

We note that the code now looks as follows: (we have opted for the code you have given us since this makes correcting easier for you):

```
.text
main:
       addi $t0, $0, 0 # $t0 = A addi $t1, $0, 200 # $t1 = B
       addi $t2, $t0, -1 # A-1
       multu $t0, $t2  # A (A-1)
       mflo $t0  # mult result in t0
        srl $t0, $t0, 1  # divide by two
        addi $t2, $t1, 1 # B+1
       multu $t1, $t2  # B (B+1)
       mflo $t1
                   # mult result in t1
        srl $t1, $t1, 1  # divide by two
        sub $t2, $t1, $t0 # end result is the difference
 end:
                          # loop t2 is the result
        j end
```

We

note that there is a total of 11 instructions. The first one being the first add instruction initializing the array value A (in this case with 0) and the last one being the difference between the sum of 0 to A and 0 to B. Since changing either A or B just changes the initialization this programs length does not depend on the input (as long as A and B can be given using add immediate, if they are longer then we will need to initialize in a more sophisticated manner using load upper immediate etc.)

This is the case in the third and fourth example. The numbers 250'000'000, 249'999'996 and 250'000'002 do not fit in the 16 bits that we have to store an immediate value. Hence, we have to initialize them as follows:

```
lui $t0, upper16Bits
ori $t0, $t0, lower16Bits
```

This adds 1 cycle per oversized initialization.

We are now running x instructions at 20 MHz, as a result it will take us x\*1/20MHz seconds to execute the program.

## Exercise 3

Compare the size/device utilization of the two implementations (before and after the modifications in Lab manual 9). What differences do you see? Briefly comment on them. *Hint: Look into the synthesis report.* 

We would like to point out that a direct comparison is not possible since Lab 9 had a different structure, it had no traditional top module that could be "mounted" with a constraints file onto the FPGA but was rather designed to be tested on the testbench. As a result, we have added the modification of lab 9 to the files of lab 8 (→ the new ALU as well as support for the new operations) which we will compare to the "normal" lab 8. This will make a comparison more useful since metaphorically speaking comparing apples to oranges would not be useful.

The results are the following (after synthesis):



no multiplication

with multiplication

As expected the implementation does not the number of inputs. We see a big increase in the number of LUT's which has been cause by the addition of the multiplication logic.

The increase in the number of LUT's can be further investigated in the Synthesis reports where we see the following picture for our ALU.v file<sup>3</sup>:

| Component   | No Multiplication | With Multiplication |
|-------------|-------------------|---------------------|
| Adders      | 1                 | 1                   |
| XORs        | 1                 | 1                   |
| Multipliers | 0                 | 1                   |
| Muxes       | 5                 | 9                   |

We can see that the inclusion of the multiplication (obviously) needs a multiplier as well as more muxes (since we now have more complex control logic).

One question that remains is how do the registers that store the low value get implemented in the code. We first thought that would be done using flipflops, which would increase the number of flipflops used in the synthesis report (see picture on previous page).

However, since we don't have an increase usage something else must be happening. After looking at the schematics it becomes clear. We are storing the value of low in a latch which can be implemented using LUT's as we learned in lecture 6 (see the slides on latches lecture  $6 \rightarrow$  they can be implemented using gates). This also contributes to the usage of LUT's but does not increase the number of flipflops needed.



5

<sup>&</sup>lt;sup>3</sup> This is only the ALU.v file and not the whole processor.

## **Feedback**

If you have any comments about the exercise please add them here: mistakes in the text, difficulty level of the exercise, or anything that will help us improve it for the next time.

We didn't really know if we had to handle the case where the numbers wouldn't fit into the immediate 16 bits. As a result, we have put a lot of time into getting this right, perhaps it could be a bit clearer if this was necessary or not.