# **EECS151: Introduction to Digital Design and ICs**

# Lecture 19 – Multipliers, Shifters

## **Bora Nikolić**



October 28, 2021, EETimes - The quickening pace of satellite launches into low-earth orbit for applications such as global internet coverage is creating a growing space congestion and debris problem. Satellite operators are now required to include additional propellent to de-orbit satellites once their lifetime expires.

Still, decades-worth of upper stages used to push payloads to intended orbits continue to coast through the solar system. In one recent example, the expended thi stage of the Apollo Saturn V—most likely from Apollo 12 launched in October 1969—showed up near Earth in its endless heliocentric orbit.





#### Review

- Adders
  - Carry is in the adder critical path
  - Mirror adders cells are commonly found in libraries
  - Ripple-carry adder is the least complex, lowest energy
  - Carry-bypass, carry-select are usually faster than ripple-carry for bitwidths > 8
- Multipliers
  - Shift-and-add is the most compact

#### Administrivia

- Homework 7 due this week
- Homework 8 due next week
  - In scope for midterm
- All labs need to be checked off by this week!
- Projects (ASIC and FPGA) started, first check point this week
- Midterm 2 is on November 4 at 7pm
  - Review session tonight at 7pm



Multipliers

4 Berkeley © 30
BY NC SA
BY NC SA

EECS151 L19 MULTIPLIERS
Nikolić Fall 2021

## "Shift and Add" Multiplier

### Signed Multiplication:

Remember for 2's complement numbers MSB has negative weight:

$$X = \sum_{i=0}^{N-2} x_i 2^i - x_{n-1} 2^{n-1}$$

ex: 
$$-6 = 11010_2 = 0 \cdot 2^0 + 1 \cdot 2^1 + 0 \cdot 2^2 + 1 \cdot 2^3 - 1 \cdot 2^4$$
  
= 0 + 2 + 0 + 8 - 16 = -6

- Therefore for multiplication:
  - a) subtract final partial product
  - b) sign-extend partial products
- Modifications to shift & add circuit:
  - a) adder/subtractor
  - b) sign-extender on P shifter register

## Convince yourself

• What's -3 x 5?

1101 x 0101





Unsigned Parallel Multiplier

7 Berkeley 6 9 6 BY NC SA

Nikolić Fall 2021

### Parallel (Array) Multiplier



Performance: What is the critical path?



multiplicand

Partial products, one for each bit in multiplier

## Parallel (Array) Multiplier

Single cycle multiply: Generates all n partial products simultaneously.



## Carry-Save Addition

 Speeding up multiplication is a matter of speeding up the summing of the partial products.

- "Carry-save" addition can help.
- Carry-save addition passes (saves) the carries to the output, rather than propagating them.
   carry-save add

carry-propagate add «

Example: sum three numbers,

 $c 0010 = 2_{10}$ 

 $s 0110 = 6_{10}$ 

 $1000 = 8_{10}$ 

$$3_{10} = 0011, 2_{10} = 0010, 3_{10} = 0011$$

$$\begin{pmatrix}
3_{10} & 0011 \\
+ & 2_{10} & 0010 \\
c & 0100 & = & 4_{10} \\
s & 0001 & = & 1_{10}
\end{pmatrix}$$
 carry-save add

- In general, carry-save addition takes in 3 numbers and produces 2: "3:2 compressor";
- Whereas, carry-propagate takes 2 and produces 1.
- With this technique, we can avoid carry propagation until final addition

### **Carry-Save Circuits**



- When adding sets of numbers, carry-save can be used on all but the final sum.
- Standard adder (carry propagate) is used for final sum.
- Carry-save is fast (no carry propagation) and inexpensive (full adders)



### Array Multiplier Using Carry-Save Addition



What is the critical path?

 $a_i$ 

in

## Carry-Save Addition

#### CSA is associative and commutative. For example:

$$(((X_0 + X_1) + X_2) + X_3) = ((X_0 + X_1) + (X_2 + X_3))$$



- A balanced tree can be used to reduce the logic delay
- It doesn't matter where you add the carries and sums, as long as you eventually do add them
- This structure is the basis of the Wallace Tree Multiplier
- Partial products are summed with the CSA tree. Fast adder (ex: CLA) is used for final sum
- Multiplier delay  $\alpha \log_{3/2}N + \log_2N$

## Wallace-Tree Multiplier

Reduce the partial products in logic stages – 4 x 4 example

Partial products

(a)

First stage

Bit position

(b)

Second stage



Final adder



(d)

#### Wallace-Tree Multiplier



Note: Wallace tree is often slower than an array multiplier in FPGAs (which have optimized carry chains)

### Increasing Throughput: Pipelining

- ullet Multipliers have a long critical path: PP generation ullet reduction tree ullet final adder
  - Often pipelined before final adder (2x flip-flops for carry-save)



Berkeley @08



# **Booth Recoding**

Berkeley © S O BY NC SA

EECS151 L19 MULTIPLIERS

Nikolić Fall 2021 17 Berke

### **Booth Recoding: Motivation**

$$a_{N-1}...$$
  $a_2$   $a_1$   $a_0$   $\leftarrow$  Multiplicand  $b_{N-1}...$   $b_2$   $b_1$   $b_0$   $\leftarrow$  Multiplier

$$a_{N-1}b_0 \dots \ a_2b_0 \quad a_1b_0 \quad a_0b_0$$
 
$$a_{N-1}b_1 \dots \ a_2b_1 \quad a_1b_1 \quad a_0b_1$$
 
$$a_{N-1}b_2 \dots \ a_2b_2 \quad a_1b_2 \quad a_0b_2$$
 
$$a_{N-1}b_3 \dots \ a_2b_3 \quad a_1b_3 \quad a_0b_3$$

N partial products (x {0, 1}

$$a_1b_0+a_0b_1$$
  $a_0b_0 \leftarrow Product$ 

How many non-zero partial products (out of N)?

- N, if B = 000...0
- 0, if B = 111...1
- N/2 on the average

### Booth Recoding: Main Idea

- Encode ...0111100... patterns:
  - $\bullet$  1111 =  $2^3 + 2^2 + 2^1 + 2^0 = 2^4 2^0$
  - Only two non-zero numbers, but needs to represent +1 and -1
- Encoding method:
  - Encode pairs of bits, by looking at a window of three bits, from LSB
    - 000 is a middle of string of 0's
    - 001, 011 are the beginnings of a string of 1's
    - 010 is an isolated 1
    - 100, 110 are ends of a string of 1's
    - 101 is the end of one string of 1's and the beginning of the next
    - 111 is the middle of a string of 1's
  - Worst case: ...010101... exactly a half of non-zero partial products

## Booth Recoding: Higher-radix multiplier

Idea: If we could use, say, 2 bits of the multiplier in generating each partial product we would halve the number of columns and speed it up!



Booth's insight: rewrite 2\*A and 3\*A cases, leave 4A for next partial product to do!

$$B_{K+1,K}^*A = 0^*A \rightarrow 0$$

$$= 1^*A \rightarrow A$$

$$= 2^*A \rightarrow 2A \text{ (or } 4A - 2A)$$

$$= 3^*A \rightarrow 4A - A$$

## Booth recoding

(On-the-fly canonical signed digit encoding!)

current bit pair

from previous bit pair

| _         |         |           |         |          |                |
|-----------|---------|-----------|---------|----------|----------------|
| $B_{K+1}$ | $B_{K}$ | $B_{K-1}$ | action  |          | D *A           |
| 0         | 0       | 0         | add 0   |          | $B_{K+1,K}^*A$ |
| 0         | 0       | 1         | add A   |          |                |
| 0         | 1       | 0         | add A   |          |                |
| 0         | 1       | 1         | add 2*A |          |                |
| 1         | 0       | 0         | sub 2*A |          |                |
| 1         | 0       | 1         | sub A   | <b>←</b> | -2*A+A         |
| 1         | 1       | 0         | sub A   |          |                |
| 1         | 1       | 1         | add 0   | ←        |                |
|           |         |           |         |          |                |

 $= 0*A \rightarrow 0$ 

 $= 1*A \rightarrow A$ 

 $= 2*A \rightarrow 2A$ 

 $= 3*A \rightarrow 4A - A$ 

## Example

Compression tree needs to support subtraction

| 0111          |       | A   |
|---------------|-------|-----|
| <b>x</b> 1010 |       | В   |
|               |       |     |
| -01110        | 10(0) | -2A |
| -00111        | 101   | -A  |
| +0111         | 001   | +A  |
|               |       |     |
| 01000110      |       |     |



A Walther WSR160 arithmometer (from Wikipedia)

EECS151 L19 MULTIPLIERS Nikolić Fall 2021 22 Berk

### **Booth Recoding Notes**

- Key advantage: Reduces the number of partial products
  - Compression tree depth becomes  $log_{3/2}[N/2]$
  - Partial product generation is slightly more complex than a NAND2
- Useful for larger multipliers
  - And some very creative solutions for repeated multiplications (FIR filters, etc)



# Signed Multipliers

24 Berkeley © © © O SA

EECS151 L19 MULTIPLIERS

Nikolić Fall 2021 24 Berkele

## Signed Array Multiplier

Two's complement



$$(-2)$$

0 0 0 1 1 0

(-3) \* (-2)

### Combinational Multiplier (signed)



## 2's Complement Multiplication (Baugh-Wooley)

Step 1: two's complement operands so high order bit is -2N-1. Must sign extend partial products and subtract the last one

Step 2: don't want all those extra additions, so add a carefully chosen constant, remembering to subtract it at the end. Convert subtraction into add of (complement + 1).

```
X3Y0 X3Y0 X3Y0 X3Y0 X3Y0 X2Y0 X1Y0 X0Y0
+ X3Y1 X3Y1 X3Y1 X3Y1 X2Y1 X1Y1 X0Y1
+ X3Y2 X3Y2 X3Y2 X2Y2 X1Y2 X0Y2
+ X3Y3 X3Y3 X2Y3 X1Y3 X0Y3
```

Step 3: add the ones to the partial products and propagate the carries. All the sign extension bits go away!

```
X3Y0 X2Y0 X1Y0 X0Y0
          X3Y1 X2Y1 X1Y1 X0Y1
     X2Y2 X1Y2 X0Y2
X3Y3 X2Y3 X1Y3 X0Y3
```

Step 4: finish computing the constants...

```
X3Y0 X2Y0 X1Y0 X0Y0
          X3Y1 X2Y1 X1Y1 X0Y1
     X2Y2 X1Y2 X0Y2
X3Y3 X2Y3 X1Y3 X0Y3
```

Result: multiplying 2's complement operands takes just about same amount of hardware as multiplying unsigned onerandal

## 2's Complement Multiplication (Baugh-Wooley)



### Multiplication in Verilog

You can use the "\*" operator to multiply two numbers:

```
wire [9:0] a,b;
wire [19:0] result = a*b; // unsigned multiplication!
```

If you want Verilog to treat your operands as signed two's complement numbers, add the keyword signed to your wire or reg declaration:

```
wire signed [9:0] a,b;
wire signed [19:0] result = a*b; // signed multiplication!
```

Remember: unlike addition and subtraction, you need different circuitry if your multiplication operands are signed vs. unsigned. Same is true of the >>> (arithmetic right shift) operator. To get signed operations all operands must be signed.

```
wire signed [9:0] a;
wire [9:0] b;
wire signed [19:0] result = a*$signed(b);
```

To make a signed constant: 10'sh37C





Multiplication with a Constant

30 Berkeley © \$0

Nikolić Fall 2021

## Constant Multiplication

- Our multiplier circuits so far has assumed both the multiplicand (A) and the multiplier (B) can vary at runtime.
- What if one of the two is a constant?

$$Y = C * X$$

"Constant Coefficient" multiplication comes up often in signal processing. Ex:

$$y_i = \alpha y_{i-1} + x_i$$



where  $\alpha$  is an application-dependent constant that is hard-wired into the circuit.

How do we build and array style (combinational) multiplier that takes advantage of a constant operand?

#### Multiplication by a Constant

- If the constant C in C\*X is a power of 2, then the multiplication is simply a shift of X.
- Ex: 4\*X



- What about division?
- What about multiplication by non- powers of 2?

### Multiplication by a Constant

• In general, a combination of fixed shifts and addition:

• Ex: 
$$6*X = 0110 * X = (2^2 + 2^1)*X = 2^2 X + 2^1 X$$



Details:



## Multiplication by a Constant

• Another example:  $C = 23_{10} = 010111$ 



- In general, the number of additions equals one less than the number of 1's in the constant.
- Using carry-save adders (for all but one of these) helps reduce the delay and cost, and using trees helps with delay, but the number of adders is still the number of 1's in C minus 2.

• Is there a way to further reduce the number of adders (and thus the cost and delay)?

4 Berkeley © S

## Multiplication using Subtraction

- Subtraction is the same cost and delay as addition.
- Consider C\*X where C is the constant value  $15_{10} = 01111$ . C\*X requires 3 additions.
- We can "recode" 15

from 
$$011111 = (2^3 + 2^2 + 2^1 + 2^0)$$
  
to  $10001 = (2^4 - \overline{2}^0)$ 

where  $\overline{1}$  means negative weight.

- Therefore, 15\*X can be implemented with only one subtractor.
  - Remember Booth encoding



## Canonic Signed Digit Representation

- CSD represents numbers using 1, 1, & 0 with the least possible number of non-zero digits.
  - Strings of 2 or more non-zero digits are replaced with a 1000..1.
  - Leads to a unique representation.
- To form CSD representation might take 2 passes:
  - First pass: replace all occurrences of 2 or more 1's:

01..10 by 
$$10..\overline{10}$$

- Second pass: same as above, plus replace 0110 with 0010 and  $0\overline{1}10$  with  $00\overline{1}0$
- **Examples:** 011101  $100\overline{1}01 = 32 - 4 + 1$

$$0010111 = 23$$

$$0011001$$

$$0101001 = 32 - 8 - 1$$

$$0110110 = 54$$

$$10\overline{1}10\overline{1}0$$

$$100\overline{1}0\overline{1}0 = 64 - 8 - 2$$

Can we further simplify the multiplier circuits?

## (K) Constant Coefficient Multiplication (KCM)

Binary multiplier:  $Y = 231*X = (2^7 + 2^6 + 2^5 + 2^2 + 2^1 + 2^0)*X$ 



- CSD helps, but the multipliers are limited to shifts followed by adds.
  - CSD multiplier:  $Y = 231*X = (2^8 2^5 + 2^3 2^0)*X$



- How about shift/add/shift/add ...?
  - KCM multiplier:  $Y = 231*X = 7*33*X = (2^3 2^0)*(2^5 + 2^0)*X$



- No simple algorithm exists to determine the optimal KCM representation.
- Most use exhaustive search method.

http://www.andraka.com/multipli.php



## **Shifters and Rotators**

Berkeley © © © © BY NC SA

EECS151 L19 MULTIPLIERS

Nikolić Fall 2021 38

# Fixed Shifters / Rotators Defined



## Variable Shifters / Rotators

- Example: X >> S, where S is unknown at design time.
- Uses: Shift instruction in processors, floating-point arithmetic, division/multiplication by powers of 2, etc.
- One way to build this is a simple shift-register:
  - a) Load word, b) shift enable for S cycles, c) read word.



- Worst case delay O(N), not good for processor design.
- Can we do it in O(logN) time and fit it in one cycle?

Nikolić Fall 2021

### Log Shifter / Rotator

• Log(N) stages, each shifts (or not) by a power of 2 places,  $S=[s_2;s_1;s_0]$ :



## LUT Mapping of Log shifter



Efficient with 2-to-1 multiplexers, for instance, 3LUTs.

Virtex6 has 6LUTs. Naturally makes 4-to-1 muxes:

Reorganize shifter to use 4to1 muxes.





## "Improved" Shifter / Rotator



- Requires careful (custom) circuit design:
  - High fanout on input signals  $x_0...x_7$
  - Large multiplexers are slow

Nikolić Fall 2021

## **Barrel Shifter**

- Cost/delay?
  - (don't forget the decoder)



### Crossbar Switch

- Nlog(N) control signals.
- Supports all interesting permutations
  - All one-to-one and one-to-many connections.
- Processors to memory/peripherals
- Communication hardware (switches, routers).





Western Electric 100-point six-wire Type B crossbar switch (Wikipedia)

#### Review

- Binary multipliers have three blocks:
  - Partial-product generation (NAND or Booth)
  - Partial-product compression (ripple-carry array, CSA or Wallace)
  - Final adder
- Multipliers are often pipelined
- Constant multipliers can be optimized for size/speed
- Shifters and crossbars are common building blocks in digital systems
  - Often require customization