# **CSE320 Final Exam Practice Questions**

### Single-Cycle Datapath/ Multi-Cycle Datapath Adding instructions

Modify the datapath and control signals to perform the new instructions in the corresponding datapath. Use the minimal amount of additional hardware and clock cycles/control states.

#### Remember:

- When adding new instructions, don't break the operation of the standard ones.
- Avoid adding ALUs, adders, Reg Files, or memories to the datapath
- You can add MUXes, logic gates, etc. but try to do minimally. (these cost in terms of area, cycle time, etc)
- a. Load Word Register (uses R instruction format)lwr Rt, Rd (Rs) #Reg[Rt] = Mem[Reg[Rd]+Reg[Rs]]
- b. Add 3 operands (new instruction format: opcode(6), rs(5), rt(5), rd(5), rx(5), (6 bits not used)) add3 Rd, Rs, Rt, Rx #Reg[Rd] = Reg[Rs] + Reg[Rt] + Reg[Rx]
- c. Add to Memory (new instruction format: opcode(6), rs(5), rt(5), rd(5), offset(11)) addm Rd, Rt, Offset(Rs) #Reg[Rd] = Reg[Rt] + Mem[sign extended offset + Reg[Rs]]
- d. Branch on less than or Equal (uses I instruction format)blez Rs, label # if Reg[Rs] < 0, PC = PC+4 + (sign-extended offset << 2)</li>
- e. Branch Equal to Memory (new instruction format: opcode(6), rs(5), rt(5), rd(5), offset(11)) beqm Rd, Rt, Offset(Rs) # if Reg[Rt] = Mem[Offset+Reg[Rs]], PC = PC + 4 + Reg[Rd]
- f. Branch Equal to 0 to Immediate (uses R instruction format) beqzi (Rs), Label #if Mem[Reg[Rs]] = 0, then PC = PC + (sign-extended offset) (NOTE: This is not PC+4, and not shifted by 2)
- g. Store Word and Increment swinc Rt, offset(Rs) #Mem[Reg[Rs] + sign extended offset] = Reg[Rt], Reg[Rs] = Reg[Rs] + 4
- h. Store Word and Decrement swdec Rt, offset(Rs) #Mem[Reg[Rs] + sign extended offset] = Reg[Rt], Reg[Rs] = Reg[Rs] 4

What if you were to add (g) and (h) simultaneously to the datapaths?

### **Datapath Timing**

- 1. Calculate the delay in the modified datapaths when performing instructions above. Assume the following delays:
  - Memory: 200ps
  - Register Files Access (READ/Write): 50ps
  - ALU and adders: 100ps
  - Logic Gates and Multiplexors: 1ps
  - All other times are negligible
- 2. Calculate the minimal clock cycle time if all of the new instructions were added in the Single and Multicycle cases.

### Other Datapath Questions

# Given MIPS code, can you determine.....

- Whatis happening at clock cycle X in the Single Cycle Datapath? Or what cycle is operation X happening?
- What is happening at clock cycle X in the Multi Cycle Datapath? Or what cycle is operation X happening?
- How many cycles it will take to execute the code?
- Can you identify the signals (control and values) in the datapath for a given clock cycle?
- And other questions of this nature....

## **Short Answer Misc Questions**

- 1. What is the primary advantage of fixed-sized opcodes?
- 2. Will a speedup of 20 on 50% of a program result in an overall speedup of at least 2 times? Explain your answer
- 3. What are the 5 components of a modern computer system (Hint: Two of them can be combined and called the processor)
- 4. What is a stored program computer?
- 5. True or False:
  - Program execution time increase when the instruction count increase (IC)
  - In a load/store architecture, the only instructions that access memory are load and store types.
  - More powerful instructions lead to higher performance since the total number of instructions executed is smaller for a given task with more powerful instructions.
  - An add operation has 3 operands (2 input and 1 output), therefore add instructions must be 3-address instructions.
- 6. In a system executing jobs, when is throughput = 1/latency?
- 7. What are the advantages and disadvantages of write-through and write-back cache modifications in shared-memory systems?

## **Pipelining**

- 1. What are the main benefits and disadvantages of pipeliing?
- 2. Name the type of pipelining hazards. Define how and when they can occur in systems (in general). Define how/when they occur in MIPS. Give a MIPS datapath or code example of each type.
- 3. Many processors have 5 or 6 stage pipelines. A typical value for the CPI (cycles per instruction) in such processors is in the range of 1.0 to 1.5. Does it mean that the *latency* of execution of most instruction 1 or 2 clock cycles? Why, or why not?
- 4. Why do conditional branches impact the performance of a pipelined implementation?
- 5. Briefly describe 2 solutions to reduce the performance impact of conditional branch instructions in a pipelined implementation.
- 6. Given sequences of instructions determine the forwarding paths and required stalls

#### Performance

1. The computer spends 82% of the time computing and 18% waiting for the disk. The instruction mix and the average cycles per instruction (CPI) for each type is:

| Type  | Instruction % | CP |
|-------|---------------|----|
| int   | 40%           | 1  |
| FP    | 30%           | 5  |
| Other | 30%           | 2  |

- a. Consider 3 modifications to the computer. Compute the speedup for each.
  - i. The processor is replaced with a new one that reduces the total computation time by 35%.
  - ii. The disk is replaced with a solid state device that reduces the disk waiting time by 85%.
  - iii. The processor is replaced with a new one that has improved floating point performance. The average floating point CPI is reduced to 3; all other aspects are unchanged.
- b. Which modification gave the best speedup?
- c. For the two modifications in part (i) that did not result in the best speedup, is it possible for them to achieve the speedup achieved by the modification in part (ii)? Show your work and explain your answer.
- You have two RiSC-16 processors X and Z, with the following characteristics. They are both multi-cycle processors, in which an instruction executes in a variable number of processor cycles. X and Z execute variations on the same instruction set (RiSC) is the following way:
  - a. Processor X implements the base instruction set, including LUI. Processor X implements multiplication in software, meaning there is not MULT instruction.
  - b. Processor Z eliminates the **LUI** instruction in favor of a **MULT** instruction, getting **LUI** functionality from **LW**.
  - c. Processor Z's **MULT** instruction uses the ALU over & over again in a loop, performing shifts and conditional adds, and requires 80 processor cycles per multiply.
  - d. Executing one MULT instruction on Processor Z eliminates on average 30 instructions that would be executed on Processor X when implemented in a software. However, Processor Z then need additional ALU functionality which increasing the ALU's critical path from 10ns to 12ns.

Also, Assume the following:

- Cache read/write: 10 nsRegister file read/write: 8 ns
- ALU operation: 10 ns for processor X, 12 ns for processor Z

Assume the following distribution of instruction types (assume that **LUI** requires 3 cycles):

|      | Processor X | Processor Z |  |
|------|-------------|-------------|--|
| MULT | 0%          | 5%          |  |

| LUI    | 5%  | 0%  |  |
|--------|-----|-----|--|
| LW     | 20% | 25% |  |
| SW     | 10% | 10% |  |
| R-Type | 45% | 40% |  |
| BEQ    | 20% | 20% |  |

For example, if processor Z executes 5 MULT instructions out of every 100. For each MULT instruction, processor X executes an additional 30 instructions.

- a. Compare the execution times of the two processors.
- b. At what clock speed for processor Z are the two designs equal in performance?
- c. Assuming the original ALU latency for processor Z (12 ns), how fast would your software emulated multiply have to be (on average) for processor X to be just as fast as processor Z? In other words, how many instructions would processor X execute in place of 1 MUL?
- 3. Two important parameters control the performance of a processor: cycle time and cycles per instruction. There is an enduring trade-off between these two parameters in the design process of microprocessors. While some designers prefer to increase the processor frequency at the expense of large CPI, other designers follow a different school of thought in which reducing the CPI comes at the expense of lower processor frequency. Consider the following machines, and compare their performance using the following instruction mix: 25% loads, 13% stores, 47% ALU instructions, and 15% branches/jumps. Assume the unmodified multi-cycle datapath and finite state machine.
  - M1: The multicycle datapath is designed with a 1 GHz clock
  - M2: A machine like M1 except that register updates are done in the same clock cycle as a memory read of ALU operation. Thus in the finite state machine, states 6 and 7 and states 3 and 4 are combined. This machine has an 3.2 GHz clock, since the register update increases the length of the critical path.
  - M3: A machine like M2 except that effective address calculations are done in the same clock cycle as a memory access. Thus states 2, 3, and 4 can be combined, as can 2 and 5, as well as 6 and 7. This machine has a 2.8 GHz clock because of the long cycle created by combining address calculation and memory access.

Find out which of the machines is fastest. Are there instruction mixes that would make another machine faster, and if so, what are they?

### Review Adder and ALU Creation and Building larger ALUs from units

Consider the 4-bit ALU below which can perform the following 5 operations: add, sub, AND, OR and negate B.

Inputs are  $A=\{A_3,A_2,A_1,A_0\}$ ,  $B=\{B_3,B_2,B_1,B_0\}$ , and  $C_{in}$ . Outputs are result  $R=\{R_3,R_2,R_1,R_0\}$  and  $C_{out}$ . Numbers are in 2's complement form. Fill in the table below, for each operation, what the values of the control signals should be. Indicate don't cares where appropriate.



| Operation | m <sub>1</sub> | M <sub>0</sub> | C <sub>in</sub> | B <sub>INV</sub> | Az |
|-----------|----------------|----------------|-----------------|------------------|----|
| Add       |                |                |                 |                  |    |
| Sub       |                |                |                 |                  |    |
| OR        |                |                |                 |                  |    |
| AND       |                |                |                 |                  |    |
| Negate B  |                |                |                 |                  |    |

## **Digital Logic**

- 1. Using Boolean algebra, prove the following:
  - a. bd' + c'd = ((b'd') + (cd))'
  - b. abc' + bc'd + a'bd = abc' + a'bd
  - c. a' + a(a'b + b'c)' = a' + b + c'
- 2. Consider the following function:  $z(x_3,x_2,x_1,x_0) = x_3'x_2 + x_3x_1x_0 + x_3x_2x_0' + x_3'x_2x_0$ 
  - a. How many literals does z contain?
  - b. Is z, minimal? If not, find the minimal expression using Boolean algebra.
  - c. Find the equivalent sum of minterms (SOP) for z (using  $\Sigma$ m notation)
  - d. Find the equivalent product of maxterms (POS) for z (using  $\Pi$ M notation)
- 3. You were recently hired as an engineer in a company that designs alarm systems custom made to meet the customer's specifications. You are asked to design a system that uses the inputs of three sensors A, B, and C. The alarm should go off (activated) when the following criteria are met:
  - When A is off, or
  - When B is on and C is off, or
  - When both A and C are on.
  - a. Write the truth table for the function
  - b. Write the Boolean expression in Product of Sums (POS) form.
  - c. Draw/Implement the function using 2-selector MUX gates.
  - d. Draw/Implement the function using 2-level NAND-NAND gates.

- e. Draw/Implement the function using a 2-bit decoder and any other gates
- 4. Find the minimal 2-level implementation using NOR-NOR gates, of a system with two 2-bit inputs (A =  $\{a_1, a_0\}$  & B =  $\{b_1, b_0\}$ ) which output the following. If A+B is even, then the output is their product. If A+B is odd, then the output is their sum.
- 5. Implement the functionality of a 2-input Decoder using minimal AND, OR and NOT gates.
- 6. Implement a 7-segment Controller.(truth table, Boolean expressions, gate logic). Practice with any combination of logic units.



7. Implement the 7-segment using a 4-selector DEMUX and Or gates.