

#### COMPUTER ORGANIZATION AND DESIGN

The Hardware/Software Interface



## **Chapter 1**

# Computer Abstractions and Technology

## **The Computer Revolution**

- Progress in computer technology
  - Underpinned by Moore's Law
- Makes novel applications feasible
  - Computers in automobiles
  - Cell phones
  - Human genome project
  - World Wide Web
  - Search Engines
- Computers are pervasive



## **Classes of Computers**

- Personal computers
  - General purpose, variety of software
  - Subject to cost/performance tradeoff
- Server computers
  - Network based
  - High capacity, performance, reliability
  - Range from small servers to building sized

## **Classes of Computers**

- Supercomputers
  - Type of server
  - High-end scientific and engineering calculations
  - Highest capability but represent a small fraction of the overall computer market
- Embedded computers
  - Hidden as components of systems
  - Stringent power/performance/cost constraints



### The PostPC Era



### The PostPC Era

- Personal Mobile Device (PMD)
  - Battery operated
  - Connects to the Internet
  - Hundreds of dollars
  - Smart phones, tablets, electronic glasses
- Cloud computing
  - Warehouse Scale Computers (WSC)
  - Software as a Service (SaaS)
  - Portion of software run on a PMD and a portion run in the Cloud
  - Amazon and Google



### What You Will Learn

- How programs are translated into the machine language
  - And how the hardware executes them
- The hardware/software interface
- What determines program performance
  - And how it can be improved
- How hardware designers improve performance
- What is parallel processing



## **Understanding Performance**

- Algorithm
  - Determines number of operations executed
- Programming language, compiler, architecture
  - Determine number of machine instructions executed per operation
- Processor and memory system
  - Determine how fast instructions are executed
- I/O system (including OS)
  - Determines how fast I/O operations are executed



### **Seven Great Ideas**

- Use abstraction to simplify design
- Make the common case fast
- Performance via parallelism
- Performance via pipelining
- Performance via prediction
- Hierarchy of memories
- Dependability via redundancy

















## **Below Your Program**



- Written in high-level language
- System software
  - Compiler & Assembler: translates
     HLL code to machine code
  - Operating System: service code
    - Handling input/output
    - Managing memory and storage
    - Scheduling tasks & sharing resources

#### Hardware

Processor, memory, I/O controllers



## **Levels of Program Code**

- High-level language
  - Level of abstraction closer to problem domain
  - Provides for productivity and portability
- Assembly language
  - Textual representation of instructions
- Hardware representation
  - Binary digits (bits)
  - Encoded instructions and data

High-level language program (in C)

Assembly language program (for MIPS)

swap(int v[], int k)
{int temp;
 temp = v[k];
 v[k] = v[k+1];
 v[k+1] = temp;
}
Compiler

swap:

muli \$2, \$5,4

add \$2, \$4,\$2

lw \$15, 0(\$2)

lw \$16, 4(\$2)

sw \$16, 0(\$2)

sw \$15, 4(\$2)

ir \$31



Binary machine language program (for MIPS) 

## Components of a Computer

#### **The BIG Picture**







- Same components for all kinds of computer
  - Desktop, server, embedded
- Input/output includes
  - User-interface devices
    - Display, keyboard, mouse
  - Storage devices
    - Hard disk, CD/DVD, flash
  - Network adapters
    - For communicating with other computers



## **Opening the Box**



## Inside the Processor (CPU)

- Datapath: performs operations on data
- Control: sequences datapath, memory, ...
- Cache memory
  - Small fast SRAM memory for immediate access to data

### **Inside the Processor**

A12 processor



## **Abstractions**

#### **The BIG Picture**

- Abstraction helps us deal with complexity
  - Hide lower-level detail
- Instruction set architecture (ISA)
  - The hardware/software interface
- Application binary interface
  - The ISA plus system software interface
- Implementation
  - The details underlying and interface



## **Technology Trends**

- Electronics technology continues to evolve
  - Increased capacity and performance
  - Reduced cost



DRAM capacity

| Year | Technology                 | Relative performance/cost |  |  |
|------|----------------------------|---------------------------|--|--|
| 1951 | Vacuum tube                | 1                         |  |  |
| 1965 | Transistor                 | 35                        |  |  |
| 1975 | Integrated circuit (IC)    | 900                       |  |  |
| 1995 | Very large scale IC (VLSI) | 2,400,000                 |  |  |
| 2013 | Ultra large scale IC       | 250,000,000,000           |  |  |



## Semiconductor Technology

- Silicon: semiconductor
- Add materials to transform properties:
  - Conductors
  - Insulators
  - Switch

## **Manufacturing ICs**



Yield: proportion of working dies per wafer



## Intel® Core 10th Gen



- 300mm wafer, 506 chips, 10nm technology
- Each chip is 11.4 x 10.7 mm



### Videos about Semiconductor Manufacturing

- Video 1: <a href="https://www.youtube.com/watch?v=bor0qLifjz4">https://www.youtube.com/watch?v=bor0qLifjz4</a>
- Video 2: <a href="https://www.youtube.com/watch?v=\_VMYPLXnd7E">https://www.youtube.com/watch?v=\_VMYPLXnd7E</a>
- Video 3: <a href="https://www.youtube.com/watch?v=vK-geBYygXo">https://www.youtube.com/watch?v=vK-geBYygXo</a>
- Video 4: https://www.youtube.com/watch?v=qm67wbB5Gml

## **Defining Performance**

#### Which airplane has the best performance?











## Response Time and Throughput

- Response time
  - How long it takes to do a task
- Throughput
  - Total work done per unit time
    - e.g., tasks/transactions/... per hour
- How are response time and throughput affected by
  - Replacing the processor with a faster version?
  - Adding more processors?
- We'll focus on response time for now...

### **Relative Performance**

- Define Performance = 1/Execution Time
- "X is n time faster than Y"

```
Performance<sub>x</sub>/Performance<sub>y</sub>
```

- = Execution time  $_{Y}$  /Execution time  $_{X} = n$
- Example: time taken to run a program
  - 10s on A, 15s on B
  - Execution Time<sub>B</sub> / Execution Time<sub>A</sub>
     = 15s / 10s = 1.5
  - So A is 1.5 times faster than B



## **Measuring Execution Time**

- Elapsed time
  - Total response time, including all aspects
    - Processing, I/O, OS overhead, idle time
  - Determines system performance
- CPU time
  - Time spent processing a given job
    - Discounts I/O time, other jobs' shares
  - Comprises user CPU time and system CPU time
  - Different programs are affected differently by CPU and system performance



## **CPU Clocking**

 Operation of digital hardware governed by a constant-rate clock



- Clock period: duration of a clock cycle
  - e.g.,  $250ps = 0.25ns = 250 \times 10^{-12}s$
- Clock frequency (rate): cycles per second
  - e.g.,  $4.0GHz = 4000MHz = 4.0 \times 10^9Hz$

### **CPU Time**

CPU Time = CPU Clock Cycles  $\times$  Clock Cycle Time =  $\frac{\text{CPU Clock Cycles}}{\text{Clock Rate}}$ 

- Performance improved by
  - Reducing number of clock cycles
  - Increasing clock rate
  - Hardware designer must often trade off clock rate against cycle count

## **CPU Time Example**

- Computer A: 2GHz clock, 10s CPU time
- Designing Computer B
  - Aim for 6s CPU time
  - Can do faster clock, but causes 1.2 x clock cycles
- How fast must Computer B clock be?

Clock Rate 
$$_{B} = \frac{\text{Clock Cycles}_{B}}{\text{CPU Time}_{B}} = \frac{1.2 \times \text{Clock Cycles}_{A}}{6s}$$

Clock Cycles<sub>A</sub> = CPU Time 
$$_A \times$$
 Clock Rate  $_A$ 

$$= 10s \times 2GHz = 20 \times 10^9$$

Clock Rate 
$$_{B} = \frac{1.2 \times 20 \times 10^{9}}{6s} = \frac{24 \times 10^{9}}{6s} = 4GHz$$



### Instruction Count and CPI

Clock Cycles = Instructio n Count  $\times$  Cycles per Instructio n

CPU Time = Instructio n Count  $\times$  CPI $\times$  Clock Cycle Time  $= \frac{\text{Instructio n Count} \times \text{CPI}}{\text{Clock Rate}}$ 

- Instruction Count for a program
  - Determined by program, ISA and compiler
- Average cycles per instruction
  - Determined by CPU hardware
  - If different instructions have different CPI
    - Average CPI affected by instruction mix



## **CPI Example**

- Computer A: Cycle Time = 250ps, CPI = 2.0
- Computer B: Cycle Time = 500ps, CPI = 1.2
- Same ISA
- Which is faster, and by how much?

$$\begin{aligned} \text{CPUTime}_{A} &= \text{Instructio n Count} \times \text{CPI}_{A} \times \text{Cycle Time}_{A} \\ &= I \times 2.0 \times 250 \text{ps} = I \times 500 \text{ps} & \text{A is faster...} \end{aligned}$$
 
$$\begin{aligned} \text{CPUTime}_{B} &= \text{Instructio n Count} \times \text{CPI}_{B} \times \text{Cycle Time}_{B} \\ &= I \times 1.2 \times 500 \text{ps} = I \times 600 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= I \times 1.2 \times 500 \text{ps} \\ &= I \times 500 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= I \times 600 \text{ps} \\ &= I \times 500 \text{ps} \end{aligned}$$
 
$$\begin{aligned} &= I \times 600 \text{ps} \\ &= I \times 500 \text{ps} \end{aligned}$$
 ...by this much

## **CPI in More Detail**

 If different instruction classes take different numbers of cycles

Clock Cycles = 
$$\sum_{i=1}^{n} (CPI_i \times Instruction Count_i)$$

Weighted average CPI

$$CPI = \frac{Clock \ Cycles}{Instruction \ Count} = \sum_{i=1}^{n} \left( CPI_i \times \frac{Instruction \ Count_i}{Instruction \ Count} \right)$$

Relative frequency

## **CPI Example**

 Alternative compiled code sequences using instructions in classes A, B, C

| Class            | А | В | С |  |
|------------------|---|---|---|--|
| CPI for class    | 1 | 2 | 3 |  |
| IC in sequence 1 | 2 | 1 | 2 |  |
| IC in sequence 2 | 4 | 1 | 1 |  |

- Sequence 1: IC = 5
  - Clock Cycles= 2×1 + 1×2 + 2×3= 10
  - Avg. CPI = 10/5 = 2.0

- Sequence 2: IC = 6
  - Clock Cycles= 4×1 + 1×2 + 1×3= 9
  - Avg. CPI = 9/6 = 1.5

## **Performance Summary**

#### **The BIG Picture**

$$CPU \ Time = \frac{Instructions}{Program} \times \frac{Clock \ cycles}{Instruction} \times \frac{Seconds}{Clock \ cycle}$$

- Performance depends on
  - Algorithm: affects IC, possibly CPI
  - Programming language: affects IC, CPI
  - Compiler: affects IC, CPI
  - Instruction set architecture: affects IC, CPI, T<sub>c</sub>



### **Power Trends**



In CMOS IC technology

Power = Capacitive load × Voltage <sup>2</sup> × Frequency

x30

x1000



## **Uniprocessor Performance**





## Multiprocessors

- Multicore microprocessors
  - More than one processor per chip
- Requires explicitly parallel programming
  - Compare with instruction level parallelism
    - Hardware executes multiple instructions at once
    - Hidden from the programmer
  - Hard to do
    - Programming for performance
    - Load balancing
    - Optimizing communication and synchronization



### **SPEC CPU Benchmark**

- Programs used to measure performance
  - Supposedly typical of actual workload
- Standard Performance Evaluation Corp (SPEC)
  - Develops benchmarks for CPU, I/O, Web, ...
- SPEC CPU2006
  - Elapsed time to execute a selection of programs
    - Negligible I/O, so focuses on CPU performance
  - Normalize relative to reference machine
  - Summarize as geometric mean of performance ratios
    - CINT2006 (integer) and CFP2006 (floating-point)



## SPECspeed 2017 Integer benchmarks on a 1.8 GHz Intel Xeon E5-2650L

| Description                                                          | Name      | Instruction<br>Count x 10^9 | CPI  | Clock cycle time<br>(seconds x 10^-9) | Execution<br>Time<br>(seconds) | Reference<br>Time<br>(seconds) | SPECratio |
|----------------------------------------------------------------------|-----------|-----------------------------|------|---------------------------------------|--------------------------------|--------------------------------|-----------|
| Perl interpreter                                                     | perlbench | 2684                        | 0.42 | 0.556                                 | 627                            | 1774                           | 2.83      |
| GNU C compiler                                                       | gcc       | 2322                        | 0.67 | 0.556                                 | 863                            | 3976                           | 4.61      |
| Route planning                                                       | mcf       | 1786                        | 1.22 | 0.556                                 | 1215                           | 4721                           | 3.89      |
| Discrete Event<br>simulation -<br>computer network                   | omnetpp   | 1107                        | 0.82 | 0.556                                 | 507                            | 1630                           | 3.21      |
| XML to HTML conversion via XSLT                                      | xalancbmk | 1314                        | 0.75 | 0.556                                 | 549                            | 1417                           | 2.58      |
| Video compression                                                    | x264      | 4488                        | 0.32 | 0.556                                 | 813                            | 1763                           | 2.17      |
| Artificial Intelligence:<br>alpha-beta tree<br>search (Chess)        | deepsjeng | 2216                        | 0.57 | 0.556                                 | 698                            | 1432                           | 2.05      |
| Artificial Intelligence:<br>Monte Carlo tree<br>search (Go)          | leela     | 2236                        | 0.79 | 0.556                                 | 987                            | 1703                           | 1.73      |
| Artificial Intelligence:<br>recursive solution<br>generator (Sudoku) | exchange2 | 6683                        | 0.46 | 0.556                                 | 1718                           | 2939                           | 1.71      |
| General data compression                                             | xz        | 8533                        | 1.32 | 0.556                                 | 6290                           | 6182                           | 0.98      |
| Geometric mean                                                       |           |                             |      |                                       |                                |                                | 2.36      |

### Pitfall: Amdahl's Law

 Improving an aspect of a computer and expecting a proportional improvement in overall performance

$$T_{improved} = \frac{T_{affected}}{improvemen \ t \ factor} + T_{unaffected}$$

- Example: multiply accounts for 80s/100s
  - How much improvement in multiply performance to get 5x overall?

$$20 = \frac{80}{n} + 20$$
 • Can't be done!

Corollary: make the common case fast



### Pitfall: MIPS as a Performance Metric

- MIPS: Millions of Instructions Per Second
  - Doesn't account for
    - Differences in ISAs between computers
    - Differences in complexity between instructions

$$\begin{aligned} \text{MIPS} &= \frac{\text{Instruction count}}{\text{Execution time} \times 10^6} \\ &= \frac{\text{Instruction count}}{\frac{\text{Instruction count} \times \text{CPI}}{\text{Clock rate}}} = \frac{\text{Clock rate}}{\text{CPI} \times 10^6} \end{aligned}$$

CPI varies between programs on a given CPU

## **Concluding Remarks**

- Cost/performance is improving
  - Due to underlying technology development
- Hierarchical layers of abstraction
  - In both hardware and software
- Instruction set architecture
  - The hardware/software interface
- Execution time: the best performance measure
- Power is a limiting factor
  - Use parallelism to improve performance

