# Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions

Pradip Bose Margaret Martonosi David Brooks









#### Hitting the wall...

- Battery technology
  - Linear improvements, nowhere near the exponential power increases we've seen
- Cooling techniques
  - Air-cooled is reaching limits
  - Fans often undesirable (noise, weight, expense)
  - \$1 per chip per Watt when operating in the >40W realm
  - Water-cooled ?!?
- Environment
  - US EPA: 10% of current electricity usage in US is directly due to desktop computers
  - Increasing fast. And doesn't count embedded systems, Printers, UPS backup?

#### Past:

- Power important for laptops, cell phones
- Present:
  - Power a Critical, Universal design constraint even for very high-end chips
- Circuits and process scaling can no longer solve all power problems.
  - SYSTEMS must also be power-aware
  - Architecture, OS, compilers





#### Outline

- Power basics
- Power and energy metrics
- Modeling abstractions
- Measuring power in real machines
- Validation
- Trends and conclusions

#### Power: The Basics

- Dynamic power vs. Static power vs. short-circuit power
  - "switching" power
  - "leakage" power
  - Dynamic power dominates, but static power increasing in importance
  - Trends in each
- Static power: steady, per-cycle energy cost
- Dynamic power: power dissipation due to capacitance charging at transitions from 0->1 and 1->0
- Short-circuit power: power due to brief short-circuit current during transitions.
- Mostly focus on dynamic, but recent work on others



#### **Short-Circuit Power Dissipation**



- Short-Circuit Current caused by finite-slope inpute signals
- Direct Current Path between VDD and GND when both NMOS and PMOS transistors are conducting

## Leakage Power



 Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage

**Power and Energy Metrics** 

#### Metrics Overview: A Microarchitect's View

- Performance metrics:
  - delay (execution time) per instruction; MIPS
    - CPI (cycles per instr): abstracts out the MHz
    - SPEC (int or fp); TPM: factors in benchmark, MHz
- energy and power metrics:
  - joules (J) and watts (W)
- joint metric possibilities (perf and power)
  - watts (W): for ultra LP processors; also, thermal issues
  - MIPS/W or SPEC/W ~ energy per instruction
    - CPI \* W: equivalent inverse metric
  - MIPS2/W or SPEC2/W ~ energy\*delay (EDP)
  - MIPS3/W or SPEC3/W ~ energy\*(delay)2 (ED2P)

#### Energy vs. Power

- Energy metrics (like SPEC/W):
  - compare battery life expectations; given workload
  - compare energy efficiencies: processors that use constant voltage, frequency or capacitance scaling to reduce power
- Power metrics (like W):
  - max power => package design, cost, reliability
  - average power => avg electric bill, battery life
- ED<sup>2</sup>P metrics (like SPEC<sup>3</sup>/W or CPI<sup>3</sup> \* W):
  - compare pwr-perf efficiencies: processors that use voltage scaling as the primary method of power reduction/control

#### E vs. EDP vs. ED<sup>2</sup>P

- Power ~ C.V<sup>2</sup>.f ~ f (fixed voltage, design)
   ∼ C (fixed voltage, freq)
- Perf ~ f (fixed voltage and design)
  - ~ IPC (fixed voltage, freq)

So, across processors that use either frequency scaling or capacitance scaling, e.g. via clock gating or adaptive microarch techniques, multiple clocks, etc., MIPS/W or SPEC/W is the right metric to compare energy efficiencies. (Also, CPI \* W)

#### E vs. EDP vs. ED2P

- Power ~ CV2.f ~ V3 (fixed microarch/design)
- Performance ~ f ~ V (fixed microarch/design)
- (For the 1-3 volt range, f varies approx. linearly with V)
- So, across processors that use voltage scaling as the primary method of power control (e.g. Transmeta),
- (perf)3 / power, or MIPS3 / W or SPEC3 /W is a fair metric to compare energy efficiencies.
- This is an ED2 P metric. We could also use: (CPI)3 \* W for a given application

#### E vs. EDP vs. ED2P

- EDP metrics like MIPS2/W or SPEC2/W cannot be applied across an arbitrary set of processors to yield fair comparisons of efficiency; although, EDP could still be a meaningful optimization vehicle for a given processor or family of processors.
- Our view: use either E or ED2P type metrics, depending on the class of processors being compared (i.e. fixed voltage, variable cap/freq - E metrics; and, variable voltage/freq designs - ED2P metrics)
  - caveat: leakage power control techniques in future processors, that use lots of low-Vt transistors may require some rethinking of metrics



## Modeling & Abstractions

# What can architects & systems people do to help?



- Micro-Architecture & Architecture
  - Shrink structures
  - Shorten wires
  - Reduce activity factors
  - Improve instruction-level control
- Compilers
  - Reduce wasted work: "standard" operations
  - More aggressive register allocation and cache optimization
  - Trade off parallelism against clock frequency
- Operating Systems
  - Natural, since OS is traditional resource manager
  - Equal energy scheduling
  - Battery-aware or thermally-aware adaptation

# What do architects & systems people need to have, in order to help?



- Better observability and control of power characteristics
  - Ability to see current power, thermal status
    - Temperature sensors on-chip
    - Battery meters
  - Ability to control power dissipation
    - Turn units on/off
    - Techniques to impact leakage
  - Abstractions for efficient modeling/estimation of power consumption



# Power/Performance abstractions at different levels of this hierarchy...

- Low-level:
  - Hspice
  - PowerMill
- Medium-Level:
  - RTL Models
- Architecture-level:
  - PennState SimplePower
  - Intel Tempest
  - Princeton Wattch
  - IBM PowerTimer

#### Low-level models: Hspice

- Extracted netlists from circuit/layout descriptions
  - Diffusion, gate, and wiring capacitance is modeled
- Analog simulation performed
  - Detailed device models used
  - Large systems of equations are solved
  - Can estimate dynamic and leakage power dissipation within a few percent
  - Slow, only practical for 10-100K transistors
- PowerMill (Synopsys) is similar but about 10x faster

#### Medium-level models: RTL

- Logic simulation obtains switching events for every signal
- Structural VHDL or verilog with zero or unit-delay timing models
- Capacitance estimates performed
  - Device Capacitance
    - Gate sizing estimates performed, similar to synthesis
  - Wiring Capacitance
    - Wire load estimates performed, similar to placement and routing
- Switching event and capacitance estimates provide dynamic power estimates

#### Architecture level models

- Examples:
  - SimplePower, Tempest, Wattch, PowerTimer...
- Components of a "good" Arch. Level power model
  - Capacitance model
  - Circuit design styles
  - Clock gating styles & Unit usage statistics
  - Signal transition statistics

## **Modeling Capacitance**

- Requires modeling wire length and estimating transistor sizes
- Related to RC Delay analysis for speed along critical path
  - But capacitance estimates require summing up all wire lengths, rather than only an accurate estimate of the longest one.



#### Register File Model: Validation

| Error Rates | Gate   | D iff  | InterConn. | Total  |
|-------------|--------|--------|------------|--------|
| Wordline(r) | 1.11   | 0.79   | 15.06      | 8.02   |
| Wordline(w) | -6.37  | 0.79   | -10.68     | -7.99  |
| Bitline(r)  | 2.82   | -10.58 | -19.59     | -10.91 |
| Bitline(w)  | -10.96 | -10.60 | 7.98       | -5.96  |

#### (Numbers in Percent)

- Validated against a register file schematic used in Intel's Merced design
- Compared capacitance values with estimates from a layout-level Intel tool
- Interconnect capacitance had largest errors
  - Model currently neglects poly connections
  - Differences in wire lengths -- difficult to tell wire distances of schematic nodes

# Accounting for Different Circuit Design Styles

- RTL and Architectural level power estimation requires the tool/user to perform circuit design style assumptions
  - Static vs. Dynamic logic
  - Single vs. Double-ended bitlines in register files/caches
  - Sense Amp designs
  - Transistor and buffer sizings
- Generic solutions are difficult because many styles are popular
- Within individual companies, circuit design styles may be fixed

#### Clock Gating: What, why, when?

- Clock Gated Clock

- Dynamic Power is dissipated on clock transitions
- Gating off clock lines when they are unneeded reduces activity factor
- But putting extra gate delays into clock lines increases clock skew
- End results:
  - Clock gating complicates design analysis but saves power. Used in cases where power is crucial.

## Signal Transition Statistics

- Dynamic power is proportional to switching
- How to collect signal transition statistics in architectural-level simulation?
  - Many signals are available, but do we want to use all of them?
  - One solution (register file):
    - Collect statistics on the important ones (bitlines)
    - Infer where possible (wordlines)
    - Assign probabilities for less important ones (decoders)
  - Use Controllability and Observability notions from testing community?

#### Power Modeling at Architecture Level

- Previous academic research has either:
  - Calculated power within individual units: ie cache
  - Calculated abstract metrics instead of power
    - eg "needless speculative work saved per pipe stage"
- What is needed now?
  - A single common power metric for comparing different techniques
  - Reasonable accuracy
  - Flexible/modular enough to explore a design space
  - Fast enough to simulate real benchmarks
  - Facilitate early experiments: before HDL or circuits...

#### **SimplePower**

- Vijaykrishnan, et al. ISCA 2000
- Models datapath energy in 5-stage pipelined RISC datapath
- Table-lookup based power models for memory and functional units
- Transition sensitive: table lookups are done based on input bits and output bits for unit being considered
- Change size of units => supply a new lookup table

#### TEM<sup>2</sup>P<sup>2</sup>EST

- Thermal Enabled Multi-Model Power/Performance Estimator: Dhodapkar, Lim, Cai, and Daasch
- Empirical Mode
  - Used for synthesizable logic blocks
  - Used for Clock distribution/interconnection
- Analytical Mode
  - Used for regular structures
  - Allows time-delay model extensions
- Temperature Model
  - Simple model links power to temperature

#### Wattch: An Overview



#### Wattch's Design Goals

- Flexibility
- Planning-stage info
- Speed
- Modularity
- Reasonable accuracy

#### Overview of Features

- Parameterized models for different CPU units
  - Can vary size or design style as needed
- Abstract signal transition models for speed
  - Can select different conditional clocking and input transition models as needed
- Based on SimpleScalar
- Modular: Can add new models for new units studied

## Modeling Units at Architectural Level



#### **Modeling Capacitance**

- Models depend on structure, bitwidth, design style, etc.
- E.g., may model capacitance of a register file with bitwidth & number of ports as input parameters

#### **Modeling Activity Factor**

- Use cycle-level simulator to determine number and type of accesses
  - reads, writes, how many ports
- Abstract model of bitline activity

## One Cycle in Wattch

|                              | Fetch                                                  | Dispatch                                        | Issue/Execute                                                                                 | Writeback/<br>Commit                                        |
|------------------------------|--------------------------------------------------------|-------------------------------------------------|-----------------------------------------------------------------------------------------------|-------------------------------------------------------------|
| Power<br>(Units<br>Accessed) | I-cache     Bpred                                      | Rename     Table     Inst. Window     Reg. File | <ul><li>Inst. Window</li><li>Reg File</li><li>ALU</li><li>D-Cache</li><li>Load/St Q</li></ul> | <ul><li>Result Bus</li><li>Reg File</li><li>Bpred</li></ul> |
| Performance                  | <ul><li>Cache Hit?</li><li>Bpred<br/>Lookup?</li></ul> | • Inst. Window Full?                            | <ul><li>Dependencies<br/>Satisfied?</li><li>Resources?</li></ul>                              | Commit Bandwidth?                                           |

- On each cycle:
  - determine which units are accessed
  - model execution time issues
  - model per-unit energy/power based on which units used and how many ports.



# Units Modeled by Wattch

- Array Structures
  - Caches, Reg Files Map/Bored tables
- Content-Addressable Memories (CAMs)
  - TLBs, Issue Queue, Reorder Buffer
- Complex combinational blocks
  - ALUs, Dependency Check
- Clocking network
  - Global Clock Drivers, Local Buffers



#### Wattch Simulation Speed

- Roughly 80K instructions per second (PII-450 host)
- ~30% overhead compared to performance simulation alone
  - Could be decreased if power estimates are not computed every cycle
- Many orders of magnitude faster than lower-level approaches
  - For example, PowerMill takes ~1hour to simulate 100 test vectors on a 64-bit adder

#### Wattch: Summary

- A preliminary but useful step towards providing modular, flexible architecture-level models with reasonable accuracy
- Future Work:
  - User selectable circuit styles (high-performance, low-power, etc)
  - Update models as technologies change







# Comparing Arch. Level power models: Flexibility

- Flexibility necessary for certain studies
  - Resource tradeoff analysis
  - Modeling different architectures
- Wattch provides fully-parameterizable power models
  - Within this methodology, circuit design styles could also be studied
- PowerTimer scales power models in a user-defined manner for individual sub-units
  - Constrained to structures and circuit-styles currently in the library
- SimplePower provides parameterizable cache structures

# Comparing Arch. Level power models: Speed

- Performance simulation is slow enough!
- Wattch's per-cycle power estimates: roughly 30% overhead
  - Post-processing (per-program power estimates) would be much faster (minimal overhead)
- PowerTimer has no overhead (currently all postprocessed based on already existing stats)
- SimplePower may have more performance overhead because of table-lookups

# Comparing Arch. Level power models: Accuracy

- Wattch provides excellent relative accuracy
  - Underestimates full chip power (some units not modeled, etc)
- PowerTimer models based on circuit-level power analysis
  - Inaccuracy is introduced in SF/AF and scaling assumptions
- SimplePower should provide highest accuracy
  - Static core (only caches are parameterized)
  - Table lookups track all transitions

Measuring power in real machines...

## Measuring power (vs. modeling it)

- First part of talk discussed power modeling.
- What about power measurement?
- Challenges:
  - Difficult to get enough motherboard information to measure the power you want to.
  - Even harder (ie impossible) to break down on-chip power into a pie chart of different contributers
  - Difficult to ascribe power peaks and valleys to particular software behavior or program constructs.





#### Limitations to meter-based Approaches

- Can only measure what actually exists
- Difficult to ascribe poower to particular parts of the code
- Difficult to get very fine-grained readings due to powersupply capacitors etc.
- Difficult to "pie chart" power into which units are dissipating it

## Monitoring power on existing CPUs: Counter-Based



- Say you wish to measure power consumption for a program running on an existing CPU?
  - Surprisingly difficult to do
  - Ammeter on power cord is difficult to synchronize with application runtimes
- Say you want to produce a pie chart of measured power dissipation per-unit for this program running an existing CPU?
  - Nearly impossible to do directly

## CASTLE: Measuring Power Data from Event Counts

#### Basic idea:

- Most (all?) high-end CPUs have a bank of hardware performance counters
- Performance counters track a variety of program events
  - Cache misses
  - Branch mispredicts...
- If these events are also the main power dissipators, then we can hope these counters can also help power measurement
- Estimate power by weighting and superimposing event counts appropriately

#### **CASTLE: Details & Methodology**

- Gather M counts for N training applications
- Compute weights using least-squares error
- Use these weights (W<sub>1</sub>-W<sub>M</sub>) to estimate power on other apps
- Consider accuracy of power estimates compared to other power measurements





## **Assumptions and Caveats**

- Leakage power is not handled currently
- Only single-ended bitline power is value dependent



#### **Performance Counters**

Are debugging aids found in most high performance processors

| Processor    | Available Events | Simultaneous Events |
|--------------|------------------|---------------------|
| PowerPC 604  | 44               | 2                   |
| PowerPC 603e | 116              | 4                   |
| POWER3       | 238              | 8                   |
| Pentium Pro  | 77               | 2                   |
|              |                  |                     |

- Track performance relative events, including:
  - cache miss
  - instruction retirement
  - branch misprediction
- Give insight into utilization (u<sub>1</sub>, u<sub>2</sub>,...,u<sub>n</sub>)
- Are not useful in calculating bitline values (v<sub>1</sub>,v<sub>2</sub>,...,v<sub>k</sub>)

#### **Estimation Problems and Solutions**

Problem One: Abundance of performance relevant counts. Lack of power relevant counts.

Solution: Approximate missing counts with available ones.

Problem Two: Insufficient counters to observe all resources.

Solution: Sample events and time-divide counters.

Problem Three: Cannot directly monitor bitline signals.

Solution: Assume register file typifies bitline data. Sample it.

#### **Validations**

- Used Wattch [Brooks-ISCA 2000], an architectural level simulator to model a four issue 660 MHz Alpha processor.
- Relied on Wattch models to compute maximum component and bitline power.
- Assumed a 10 millisecond OS time quanta for sampling.
- Assumed 2 hardware counters and used a total of 10 events: instructions decoded, instructions executed, instructions retired floating point operations executed, branches retired, branches decoded
  - L1 instruction cache accesses, L1 data cache accesses, L2 unified cache accesses, main memory requests

#### Problem One - Counter Relevance

Since counter events don't directly observe all power relevant events, we apply platform specific heuristics.

#### Example 1

register rename lookups = 2 x instructions decoded

#### Example 2

integer alu accesses = instructions executed - floating point operations

#### Example 3

physical register accesses = 5 x instructions executed







## Implementation on Pentium Pro



System Configuration: Intel Pentium Pro 200MHz, 128 MB RAM, 2 GB Internal SCSI HD, Linux 2.2.16

**Estimation Techniques** 

Utilization Approximations - Fully Implemented Data Value Estimates - Assume constants

Validation: HP 34401A Multimeter with automatic data logging

#### Pentium Pro Power Model





## Other Measurement Techniques

- Thermal sensors
  - [Sanchez et al. 1995]
  - PowerPC includes thermal sensor and allows for realtime responses to thermal emergencies.
    - Eg. Reduce instruction fetch rate

# Comparing different measurement/modeling techniques

- Choice of technique depends on experiment to be done
- Measuring different software on unchanging platform
  - Real platform probably better
- Measuring impact of hardware design changes
  - Need simulations, since real hardware doesn't exist...

#### Validation







#### **Bounding Perf and Power**

- Lower and upper bounds on expected model outputs can serve as a viable "spec", in lieu of an exact reference
- Even a single set of bounds (upper or lower) is useful

| Test Case | Performance Bounds |          |        | ounds | Utilization/Power Bounds |             |
|-----------|--------------------|----------|--------|-------|--------------------------|-------------|
| Number    | Cpi (ub)           | Cpi (lb) | T (ub) | T(lb) | Upper bound              | Lower bound |
| TC.1      |                    |          |        |       |                          |             |
| TC.2      |                    |          |        |       |                          |             |
| 100       |                    |          |        |       |                          |             |
| •         |                    |          |        |       |                          |             |
| TC.n      |                    |          |        |       |                          |             |



```
Static Bounds - Example
                               Consider an in-order-issue
                               super scalar machine:
 fadd fp3, fp1, fp0
 lfdu
        fp5, 8(r1)
                               • disp_bw = iss_bw = compl_bw = 4
        fp4, 8(r3)
                               • fetch_bw = 8
 lfdu
                               • l_ports = ls_units = 2
 fadd fp4, fp5, fp4
                               • s_ports = 1
 fadd fp1, fp4, fp3
                               • fp_units = 2
 stfdu fp1, 8(r2)
 bc
        loop_top
                          N = number of instructions/iteration = 7

    Steady-state loop cpi performance is determined by the

  narrowest (smallest bandwidth) pipe
   - above example: CPIter = 2; cpi = 2/7 = 0.286
```









# Abstraction via Separable Components The issue of absolute vs. relative accuracy is raised in any modeling scenario: be it "performance-only", "power" or "power-performance." Consider a commonly used performance modeling abstraction: Increasing core concurrency and overlap (e.g. outstanding mess support) Slope = miss penalty (MP) Slope = miss penalty (MP) CPI = CPI infcache HMR \* MP Cache miss rate, MR (misses/instr)









#### **Accuracy Conclusions**

- Separable components model (for performance, and *probably* for related power-performance estimates):
  - > good for relative accuracy in most scenarios
  - > absolute accuracy depends on workload characteristics
- Detailed experiments and analysis in:

#### Brooks, Martonosi and Bose (2001):

"Abstraction via separable components: an empirical study of absolute and relative accuracy in processor performance modeling," IBM Research Report, Jan, 2001 (submitted for external publication)

- Power-performance model validation and accuracy analysis:
  - > work in progress

Trends and conclusions...

## Leakage Power: Models and Trends

- Currently: leakage power is roughly 2-5% of CPU chip's power dissipation
- Future: without further action, leakage power expected to increase exponentially in future chip generations
- The reason?
  - Supply Voltage ↓ to save power =>
  - => Threshold voltages ↓ to maintain performance
  - => Leakage current ↑

## Other technology trends and needs

- Need:
  - Good models for leakage current
  - Ways of handling chips with more than one Vt
  - Models that link power and thermal characteristics

#### Other resources

- Tutorial webpage
  - Web versions of demos
  - Access to slides
  - Semi-comprehensive Power Bibliography...