

# SC3050 Advanced Computer Architecture

# LAB-2

The register file (RF) in the CPU is used to hold operands and addresses for one or more clock cycles. In this lab you will analyze the implementation of the RF. Besides, you also need to find the area and access time of RF of different sizes (number of registers) and for various bit-widths of registers. You are provided with parameterizable Verilog code where the bit-width and the number of registers can be changed. You can specify the bit-width and the number of registers in the Verilog code of the RF and find the change in LUT consumption and access speed from the synthesis results. In the second part of lab 2, you need to build a single cycle datapath for Register-type instruction. You are also given with the Verilog code for the control logic. You need to notice how the code of register file with the code for control logic and the ALU (of Lab 1) are integrated by module instantiation to form the datapath. You are required to understand the given Verilog code completely.

# PARAMETERIZABLE IMPLEMENTATION OF RF

You are provided with the Verilog file written in behavioral approach for implementing the parametrizable RF where the bit-widths and number of registers can be changed to find the hardware and access time corresponding to different bit-widths and address-widths. The block diagram of the register file is given in Figure 1.



Figure 1: Block diagram of the register file.

Note that address width depends on the number of registers in the RF. If the address width is "n" then number of registers in the register file is "2^n". The following are the characteristics of the behavioral RF. The RF is positive edge triggered, which means that the data available at the write port (wdata) is written into the selected register in the register file on the positive edge of clock if the write enable (wen) signal is high. The register in which the wdata could be written is selected by a decoder according to the write address (waddr). The read ports are combinational circuits. Data stored the registers specified by raddr1 and raddr2 are obtained as radata1 and rdata2, respectively. All registers are reset to 'zero' when the reset signal is high.

- 1. The Verilog file "regfile.v" (provided) is a parameterizable behavioral code.
- 2. The "define.v" file is used to define DSIZE (bit-width of the register) = 4, NREG (number of registers) = 4, ASIZE (address size) = 2. The testbench "regfiletest.v" (provided) helps to test the behavioral implementation of RF. The test-bench uses file operation (an input file is used to read data therefrom and an output file to write data).
- 3. Verify the operation by simulation.



- 4. Synthesize the given behavioral code and plot the "number of slices" used vs "No. of registers" and 'the minimum clock cycle period' in ns vs 'No. of registers' (change the parameter for No. of registers {NREG=4, 8, 16, 32, 64}). Here we can keep the bit-width of each register (DSIZE parameter) to be constant (for example DSIZE=64). Note that when we change number of registers (NREG) the address parameter ASIZE also needs to be changed since NREG = 2^(ASIZE).
- 5. Please note that we are taking the minimum clock period for the delay and not the maximum combinational delay. Minimum clock period can be calculated by the following equation:

```
Minimun Clock Period = Actual Period - Worst Negative Slack(WNS)
```

In Vivado, to get the minimum clock period in ns, you may find the worst negative slack (WNS) first, and then calculate the value of minimum clock period following the equation. To get the value of WNS, you may manually type commands in the TCL Console to access the timing summary of the synthesis: After **EACH** synthesis, run the command below in TCL Console first:



Then, run the next command to generate the timing summary:

```
report_timing_summary
```

In the timing summary report, navigate to the "Design Timing Summary" to find the value of WNS. Then, the minimum clock period can be calculated by subtracting WNS from the Actual Period (10 ns).

#### EVALUATION - 1

| No. of Registers (NREG) | Bit-width of the register (DSIZE) | No of register slices | No of LUT slices | Minimum clock<br>Period (in ns) |
|-------------------------|-----------------------------------|-----------------------|------------------|---------------------------------|
| 4                       | 64                                |                       |                  |                                 |
| 8                       | 64                                |                       |                  |                                 |
| 16                      | 64                                |                       |                  |                                 |
| 32                      | 64                                |                       |                  |                                 |
| 64                      | 64                                |                       |                  |                                 |

Table 1: LUT consumption and delay for behavioral register file vs bit-width

(Please make sure to implement all the experiments and fill out Table 1 in the last page of this manual, you'll need to submit the outcomes at the end of this lab.)



6. Synthesize the given behavioral code and plot the number of slices used vs bit-width of the register (DSIZE) and delay in ns vs bit-width of the register (change the parameter for bit-width of the registers {DSIZE=4, 8, 16, 32, 64}). Here we can keep the number of registers (NREG parameter) to be constant (for example NREG=32) and thus ASIZE will be a constant =5.

| Bit-width of the register (DSIZE) | No. of registers (NREG) | No of slices | register | No of LUT slices | Minimum clock<br>Period (in ns) |
|-----------------------------------|-------------------------|--------------|----------|------------------|---------------------------------|
| 4                                 | 32                      |              |          |                  |                                 |
| 8                                 | 32                      |              |          |                  |                                 |
| 16                                | 32                      |              |          |                  |                                 |
| 32                                | 32                      |              |          |                  |                                 |
| 64                                | 32                      |              |          |                  |                                 |

Table 2: LUT consumption and delay for behavioral register file vs number of registers

(Please make sure to implement all the experiments and fill out Table 2 in the last page of this manual, you'll need to submit the outcomes at the end of this lab.)

## CHALLENGE - 1 (Optional)

- 1) Plot the graph, area in LUT slices (vs) No. of registers (NREG) as well as delay (vs) No. of registers (NREG) for the register file module for NREG = 4, 8, 16, 32, 64. Set DSIZE (bit-width of each register) = 64.
- 2) Plot the graph, area in LUT slices (vs) bit-width of register (DSIZE) as well as delay (vs) bit-width of register (DSIZE) for the register file module for DSIZE = 4, 8, 16, 32, 64. Set NREG (number of registers) = 32.







### SIMPLE DATAPATH AND CONTROL

In this section you will use the ALU (64 bits) from the first lab and the register file (parameter NREG (number of registers) =32, DSIZE (bit-width of each register) =64, and ASIZE (address size to access 32 registers) = 5) which you have analyzed earlier in this lab to make a simple Datapath and control unit of a processor. The Verilog code for control unit which generates the control signals for the operation of RF and ALU is also provided ("control.v"). Instead of directly feeding data inputs to the ALU or RF, you need to do it by an instruction and according to the signal generated by the control unit to perform a sequence of operations. Please note that you have 64-bit data, and a 32-bit instruction words.

The ADD, SUB, AND, XOR, and ORR instructions are of R-format instructions with addressing mode register addressing (R- format).

#### **R-format ALU instructions:**

Meaning: [Rd]  $\leftarrow$  [Rn] (OP) [Rm]: the registers Rm and Rn are the source registers, and the result is stored in the destination register Rd. Instruction structure: <operation>, <Rd>, <Rm>.

The bit assignments to different fields of R-format are shown below.

| opcode | Rm    | shamt | Rn   | Rd  |  |
|--------|-------|-------|------|-----|--|
| 31     | 21 20 | 16 15 | 10 9 | 5 4 |  |

Example 1: ADD X5, X4, X3 (meaning:  $X5 \leftarrow (X4 + X3)$ ). The machine format is given below.

| 0000000000 |    | 00011 | 000000 |    | 00100 |   | 00101 |
|------------|----|-------|--------|----|-------|---|-------|
| 31         | 21 | 20 16 | 5 15   | 10 | 9     | 5 | 4     |

Example 2: XOR, X8, X9, X10 (meaning: X8 ← X9 (XOR) X10):

| 0000000011 |    | 01010 |    | 000000 |    | 01001 |   | 01000 |
|------------|----|-------|----|--------|----|-------|---|-------|
| 31         | 21 | 20    | 16 | 15     | 10 | 9     | 5 | 4     |

Please note that as per define.v, the opcodes for ADD, SUB, AND, XOR and ORR are "00000000000", "00000000001", "00000000010", "00000000011" and "00000000100" respectively. A block diagram of the expected design is shown in Figure 2:



Figure 2: Datapath and Control



The control unit provides the control words for RF (write enable "wen" signal), and ALU (operation "op[2:0]" signal, for selecting the operation in ALU). RFs are initialized to zero. We need to load data into the registers of the RF before we do any operations. Since you have not yet implemented memory operations, we need to load few values into the register file by hardcoding as done in regfile.v shown below.

```
verilog \cong 
regdata[1] <=2; //hardcoding few values into register file for initialization 
regdata[2] <=3;</pre>
```

In order to put together the datapath and control, we need to connect the RF, ALU and the control unit. The main "datapath\_single.v" is provided to you. In the given Verilog file you need to instantiate "regfile.v", "control.v", are "alu.v" and connect together. Modules are instantiated inside other modules, and each instantiation creates a unique object from the template. The instantiated module's ports must be matched to those defined in the template. This is specified:

- (i) By name, using a dot (.) ".template port name (name of wire connected to port)" or,
- (ii) By position, placing the ports in exactly the same positions in the port lists of both the template and the instance.

In Figure 3, an example of module instantiation is given by connecting the submodule "control.v" to the main module/top module "datapath.v".



Figure 3: Module instantiation

The rest of the module instantiations are done in a similar way as shown in Figure 4.

```
control C0
(.inst_cntrl(inst[31:21]),.wen_cntrl(wen),.aluop_cntrl(aluop));
regfile RF0 (.clk(clk), .rst(rst), .wen(wen), .raddr1(inst[9:5]),
.raddr2(inst[20:16]), .waddr(inst[4:0]), .wdata(aluout),
.rdata1(rdata1), .rdata2(rdata2));
alu ALUO (.a(rdata1), .b(rdata2), .op(aluop), .out(aluout));
```

Figure 4: Module instantiation of all the four submodules



Note that the opcode encoding is given in the define.v file, and should not be modified. NREG (number of registers) = 32, DSIZE (bit-width of each register) = 64 and ASIZE (address size) = 5.

- Test the datapath implementation by creating a test bench as learned in the Lab 1. After you finished the main code, you need to check the bit-width of the inputs and outputs, add the clock and the test inputs as shown below:
  - a. <u>Check the bit-width of input "inst"</u>, as well as the bit-width of output "aluout". <u>Make their bit-width in accordance with "define.v" file (ISIZE=32, DSIZE=64 and ASIZE=5).</u>
  - b. Inserting CLK signal: before the "initial" statement, insert:

```
Verilog \( \text{always #15 clk = \( \sigma \text{clk}; \)
```

c. Add the following test vectors on the portion "//Add the stimulus":

```
#10 rst = 1;

#50 rst = 0;

#50; //wait 50 ns for global reset to finish

inst=32'b00000000000010000000000000000100011; // ADD X3, X1, X4

// ISIZE =32, DSIZE=64 bit and write and read ADDR= 5 bit.

// In this example opcode = 0, Rn = 1, Rm = 4 and Rd = 3. The 'shamt' filed is made zero.

#100; inst=32'b00000000001000100000000000100100;//SUB X4, X1, X2
```

d. More 32-bit instructions can be added to fully test the datapath below the earlier instructions.

```
#100; inst= 32'h00040023; // ADD X3, X1, X4 (already written above - shown in hex format) #100; inst= 32'h00220024; // SUB X4, X1, X2 (already written above - shown in hex format) #100 inst=32'h00440027; // AND X7, X1, X4 #100 inst=32'h006200E8; // XOR X8, X7, X2
```

- You can use <a href="https://personal.ntu.edu.sg/smitha/OPCoder/OPCoder/converter.html">https://personal.ntu.edu.sg/smitha/OPCoder/OPCoder/converter.html</a> to convert the ARM instructions to machine code in binary or hexadecimal number system.

# **EVALUATION - 2**

1) Complete the Verilog code for the full datapath which includes (RF, control and ALU) and verify its operation.

(Please make sure to complete this task but it does NOT need to be included in the last page of this manual, i.e. the lab report.)

## ADDITIONAL INFORMATION

Note that as we keep on increasing the number of registers and the bit width of the operands, the LUTs consumed by the architecture keeps on growing. This creates a problem for hardware engineers to



# SC3050 Advanced Computer Architecture

incorporate higher bit width as well as complex architectures onto FPGA. The designers of FPGAs used dedicated circuits called "hard macros", which can be used for the implementation of arithmetic circuits in less area. One of these is "DSP" that we saw in Lab 1. Consider the case of Register file. If we keep on increasing the DSIZE and NREG the LUT size becomes large. This can be handled by using a hard macro called a "BRAM". Note that hard macros cannot be optimized further by the synthesis tools.

| SC3050 | Advanced | Computer | Architecture |
|--------|----------|----------|--------------|
|        |          |          |              |

| Lab Report for Lab 2 (Name: , Matric: , Group: |
|------------------------------------------------|
|------------------------------------------------|

# **EVALUATION**

Please fill in the two tables based on the experiments.

| No. of Registers | Bit-width of the | No of reg | ister No of LUT slices | Minimum clock  |
|------------------|------------------|-----------|------------------------|----------------|
| (NREG)           | register (DSIZE) | slices    |                        | Period (in ns) |
| 4                | 64               |           |                        |                |
| 8                | 64               |           |                        |                |
| 16               | 64               |           |                        |                |
| 32               | 64               |           |                        |                |
| 64               | 64               |           |                        |                |

Table 1: LUT consumption and delay for behavioral register file vs bit-width

| Bit-width of the | No. of registers | No of  | register | No of LUT slices | Minimum clock  |
|------------------|------------------|--------|----------|------------------|----------------|
| register (DSIZE) | (NREG)           | slices |          |                  | Period (in ns) |
| 4                | 32               |        |          |                  |                |
| 8                | 32               |        |          |                  |                |
| 16               | 32               |        |          |                  |                |
| 32               | 32               |        |          |                  |                |
| 64               | 32               |        |          |                  |                |

Table 2: LUT consumption and delay for behavioral register file vs number of registers

## CHALLENGE (Optional, no marks)

- 1) Plot the graph, area in LUT slices (vs) No. of registers (NREG) as well as delay (vs) No. of registers (NREG) for the register file module for NREG = 4, 8, 16, 32, 64. Set DSIZE (bit-width of each register) = 64.
- 2) Plot the graph, area in LUT slices (vs) bit-width of register (DSIZE) as well as delay (vs) bit-width of register (DSIZE) for the register file module for DSIZE = 4, 8, 16, 32, 64. Set NREG (number of registers) = 32.



