#### IEEE 2005 CUSTOM INTEGRATED CIRCUITS CONFERENCE

# A 90nm Power Optimization Methodology and its' Application to the ARM 1136JF-S Microprocessor

A. Khan<sup>+</sup>, Senior Member, IEEE, P. Watson<sup>\*</sup>, G. Kuo<sup>+</sup>, D. Le<sup>+</sup>, T. Nguyen<sup>+</sup>, S. Yang<sup>+</sup>, P. Bennet<sup>+</sup>, P. Huang<sup>+</sup>, J. Gill<sup>+</sup>, D. Wang<sup>+</sup>, I. Ahmed<sup>+</sup>, P. Tran<sup>+</sup>, H. Mak<sup>+</sup>, O. Kim<sup>+</sup>, F. Martin<sup>+</sup>, Y. Fan<sup>+</sup>, D. Ge<sup>+</sup>, J. Kung<sup>+</sup>, V. Shek<sup>+</sup>

Abstract—An electrical and physical design power optimization methodology and design techniques developed to create an ARM 1136JF-S microprocessor in 90nm standard CMOS are presented. A 40% reduction in power dissipation has been achieved while maintaining a 355 MHz operating clock rate under typical conditions. Functional and electrical design requirements were achieved with the first silicon.

Index Terms—IC Design, Power Reduction, IC Design Methodology, IC Design Techniques, Electrical and Physical Design, Clock Design and Distribution, Power Distribution, Signal Integrity, Design Validation.

#### I. INTRODUCTION

ystem-on-chip ICs are increasingly critical to the functionality and performance of consumer-oriented products, such as digital cellular handsets, storage devices, video game consoles, consumer display devices, graphics cards, digital TVs, PC applications, broadband access devices and DVDs [1]. The continuing proliferation of portable consumer products places a growing emphasis on the delivery of required performance while maximizing the battery-driven operating period. Hence, energy efficient design is an active area of research and development in semiconductor electronics [2, 3]. Continued advances in semiconductor processing are increasing design challenges associated with energy-efficient design: For example, at the 90nm process node, the leakage current of low V<sub>T</sub> transistors can be six fold greater than that of high V<sub>T</sub> transistors. Leakage power plays a significant role in the overall energy efficiency of an IC.

A variety of power management techniques have been developed and applied to date. These include the use of clock gating, frequency scaling and the selection of specially-developed process options or libraries. More recently, such techniques have included voltage scaling and performance optimization using multi- $V_T$  cell libraries.

In this paper, a design methodology developed and applied to reduce power dissipation in a 355 MHz ARM1136 JF-S microprocessor is presented. The microprocessor has been realized in a standard 90nm CMOS process with functionality at the first silicon. Highlighted design techniques and the results achieved are presented.

### II. SYSTEM AND IC ARCHITECTURE

The developed IC, which includes an ARM1136JF-S microprocessor and related circuitry, was designed to function in an ARM-developed system reference design. System and IC architecture are presented.

#### II.1. System

A system to validate the performance of ARM microprocessors has been developed by ARM. It permits application-level software to be run on the IC and enables measurement of performance benchmarks (e.g., Dhrystones), thus enabling valid measurements of power dissipation. Functional capabilities of this system are noted [4].



Fig. 1a: ARM RealView® Validation System Board

Fig. 1b: Validation System Functionality

#### II.2. IC Architecture

The major building blocks in the IC are: The microprocessor core, the ETM11 trace macro and the ETB11 trace buffer. A multi-level advanced high-performance bus (AHB) bus fabric also exists at the chip level, to connect the AHB Lite ports of the core to the full AHB interface, which is accessible from the package pins. The bus structure also provides access to the 128 KB on-chip test chip RAM to enable data transfers from any four ports concurrently.



Fig. 2: IC Functional Diagram

Two additional validation co-processors are integrated to drive the co-processor interface and to ensure correct functioning between core processing and co-processor wait

<sup>&</sup>lt;sup>+</sup> Cadence Design Systems, Inc., 535 River Oaks Parkway, MS 4.1.020, San Jose, CA 95134, U.S. A.; <a href="mailto:akhan@cadence.com">akhan@cadence.com</a> \*ARM Ltd., 110 Fulbourn Road, Cambridge CB1 9NJ, U. K.; <a href="mailto:philip.watson@arm.com">philip.watson@arm.com</a>

states (e.g., instruction execution, wait state, data load and store, etc.) - as is the support logic required for manufacturing test and debug.

The ARM1136JF-S is a cached microprocessor core with full virtual memory capabilities. In this IC, it is configured with a 16 KB cache for both data and instruction streams, made up of data, tag, valid and dirty (modified) RAMs. An associated 16 KB TCM configuration is also included, for both data and instruction streams. Additional Tag RAMs and TLBs support the complete processor architecture. 44 memory instances are integrated. The core supports the ARM and Thumb instruction sets, and a range of extended DSP instructions, along with Jazelle enabled technology to enable direct execution of Java byte codes.

#### III. ELECTRICAL/PHYSICAL DESIGN METHODOLOGY

The ARM1136JF-S includes several micro-architectural features to reduce energy consumption, including efficient memory management and architectural clock gating. "Run" and "Standby" power management modes for the microprocessor are supported in this IC. Architectural level clock gating has also been inferred in the RTL code, to enable the software interface to control the gating to all but the interrupt logic.

Further, electrical and physical design techniques for leakage/speed optimization in a multiple supply voltage environment were developed or enhanced and applied. These include: RTL synthesis leakage/speed optimization, VDD selection, power distribution, clock gating, power/delay analysis of multiple operating points, and the electrical design, analysis and verification of the full IC.

# III.1. RTL Synthesis for Multiple Supply Voltage Operation and Leakage/Speed Optimization

First, the code and memory configurations for the microprocessor were set. Then the RAM configuration was verified for correct functionality in the selected 90nm CMOS process. The microprocessor was then verified in a predeveloped simulation test bench environment comprising ~700 test cases (>135K vectors) - and requiring several days of run time - to exercise the full functionality of the processor architecture. The vectors sets generated within this environment also enabled the generation of vectors indicative of power activity, in both VCD and TCF formats, as required for subsequent detailed power analysis. This fully-verified RTL was used throughout the design process as the golden reference for all steps, including, functional verification and regression testing prior to tape-out.

The design methodology has been extended to support the use of multiple voltage domains in automated place-androuted blocks (including the management of voltage interface cells) [5]. Two designs were implemented concurrently to enable valid comparisons of power consumption: The baseline design and the low power design.

Conventionally, multiple-pass synthesis or postprocessing scripts have been utilized to optimize timing, power and area sequentially. This approach adds complexity to the design flow and - since such synthesis requires sequential trade-offs between the equally-important design requirements of timing, power and area - does not ensure that the optimal balance between these requirements has been achieved.

In the present methodology, the complete RTL was synthesized with single-pass concurrent optimization for leakage power, timing and area, with the newly-developed global optimization synthesis technology, using dual  $V_{\rm T}$  libraries (normal and high). This strategy was applied for the 1.0V and the 0.8V VDD domains. Timing critical logic paths were mapped into the 1.0V domain and non-critical logic was mapped into the 0.8V domain. The automation of this multiple-constraints optimization simplified and accelerated the design's development. 62% and 38% of the the standard cells were placed in the 0.8V and 1.0V domains, respectively.

Several techniques were automatically applied during synthesis to minimize chip power consumption. These include buffer removal and logic resizing (before clock tree synthesis), and pin swapping to apply high transition nets to drive low capacitance inputs (Figs. 3a, 3b).





Figure 3a: Buffer Removal and Logic Resizing

Figure 3b: Input High Transition Signals to Low Capacitance Pins

Slow transition nets were also buffered to minimize the duration within which both pFET and nFET transistors conduct current simultaneously. Logic topologies were also restructured to minimize high switching nets (Figs. 4a, 4b).





Figure 4a: Slew Rate Reduction

Figure 4b: Logic Restructuring

These automated methods reduce both leakage and dynamic power consumption: The concurrently-optimized gate-level netlist contained domain modules for both VDDs.

#### III.2. VDD Selection and Power Distribution

The 0.8V and 1.0V VDD domains were selected based on analysis of the standard cells timing performance and leakage, standby and dynamic power requirements. At the selected VDD levels, the performance-critical standard cells in the design exhibit comparable power-delay-product performance (Figs. 5a, 5b).

Signal nets and the associated routing topologies are allowed to traverse across different potential blocks. The interface signals between different voltage domains require voltage translation or isolation; this is accomplished by the use of voltage level shifting (VLS) cells. These cells must be designed and laid out to handle hot n-well spacing, not

**27-2-2** 

only within the VLS cell itself but also from VLS cells to other cells (including other VLS cells.) The power ramp-up/ramp-down sequence typically requires greater than minimum n-well spacing. Several microns of n-well spacing may be adequate, depending on process and product requirements. When VLS cells are laid out within a row of standard cells, the newly-developed design technology can align all VLS cells to avoid vertical n-well spacing violations. In our case, signals from cells which are within the 0.8V power domain need to have VLS cells applied before exiting this domain to enter the 1.0V logic domain. The optimization design technology needs to recognize that buffers for 1.0V signals should not be inserted within the 0.8V domain.





Fig. 5a: Normalized Power Delay Product at 0.8V and 1.0V VDD

Fig. 5b: Normalized Cell Delays at Dual V<sub>T</sub>, VDD Levels vs. Loading

Conventionally, VLS cells have been manually inserted, placed and routed; this process is labor-intensive and error-prone. In the present methodology, the VLS cells were automatically inserted, placed and routed, thereby eliminating the conventional constraint, which limited the use of cross-domain signals. Further, the level of automation minimized the impact of floorplan changes that occur during the exploration and implementation phases.

The power distribution network has been generated automatically with new design technology. This technology further enabled power and place-and-route analysis and optimization across the different operating voltages. The main 1.0V core power has been distributed adjacent to the 0.8V block boundary. The need to provide multiple VDD levels to VLS cells requires that all level shifter placements are abutted continuously to the power domain boundary. Such domain-perimeter-driven level shifter placement may not be ideal for timing closure, but the design impact of this constraint can be addressed early in the design cycle, due to known placements.

The placement design technology needs to take multiple power domains into account when automatically inserting antenna diode cells, so as to avoid creating short-circuit connections across power domains. Using n-p diodes (connected to ground) can address this hazard effectively.

#### III.3. Clock Gating

The microprocessor core includes architectural-level clock gating. An automated design flow has been developed and applied to implement additional clock gating, inferred from the RTL through low-power synthesis. >1,000 clock gating cells have been identified and added to shut-off dynamic power dissipation in quiescent logic, based on real-

time application-level requirements. Once the initial clock gates were identified, clock decloning was applied by moving clock gating to the highest hierarchical node of the logic tree, to combine the clock gate cells amongst multiple logic modules (Figs. 7a, 7b). This approach moves the multiplexer to a higher location in the hierarchical clock network, thereby reducing power. It also reduced the total number of clock gating cells, reduced complexity for clock tree synthesis, and reduced clock insertion delays.





Fig. 7a: Original Clock Distribution

Fig. 7b: Optimized Clock

# III.4. Timing (Electrical) Closure in a Multi-VDD Domain Design (Including ECOs)

The electrical performance of logic implemented in a multiple supply voltage (MSV) scheme is heavily dependent on VLS placement: If VLS cells are not placed in the physically-optimal location, the timing of routed nets is not optimal due to detoured routing, which creates additional timing penalty (Figs. 8a, 8b).







Fig. 8b: Impact of Level Shifter Placement on Timing

Power supply aware timing and clock constraints are required to avoid such placements. In current design practice, only timing constraint information is typically provided to define the boundary constraints between two physical timing domains; cross-domain power constraints which could guide optimal level shifter placements - are typically not available. Similarly, when engineering change orders (ECOs) are applied after placement and routing, the routing engine used to implement such ECOs needs to take power domains into account: The optimization of signal nets routed across different VDD domains typically requires modifications to the logical netlist to insert VLS cells, which affects the routing topology also. The post-layout RCloading data available to the optimization engine may not reflect such netlist changes, making automated timing repairs non-optimal. Consequently, an iterative approach has been applied for timing optimization with VLS insertions: To determine the physically-optimal VLS placement in the timing-driven place-and-route mode, timing

**27-2-3** 773

placement without VLS cells is completed first and then, based on these results, VLS cells are placed in the appropriate physical locations. This process has now been automated within the design technology.

To reduce power further, after place-and-route, normal  $V_T$  cells were selectively replaced with high  $V_T$  cells while maintaining the timing required, on a net-by-net basis. Such cell swapping is restricted to cells which occupy the same area footprint as the original cell.

The effective current source model (ECSM) cell delay library was generated at multiple VDD levels at the outset. The model's capability to represent delay at various VDD voltages on a per-cell-instance basis, whether due to power management or IR-drop effects, contributed directly to the results achieved [7, 8]. Results generated with this numerical model deviate ~2% from full circuit simulation (Fig. 10b).

IR drop has been optimized with design technology which explicitly models the current flow to compute voltage drops, based on automatically-extracted, layout-accurate VDD and VSS resistor mesh networks (Figs. 9a, 9b).





Fig. 9a: 0.8V VDD IR Drop Analysis

Fig. 9b: 1.0V VDD IR Drop Analysis

Signal noise immunity varies by VDD levels. Signals in the 1.0V domain can tolerate higher noise than signals in the 0.8V domain. At the time of this development, the design technology to analyze / optimize signal integrity by taking VDD levels into account was not available. A conservative single noise threshold was therefore set, and all signal nets verified to it, to ensure acceptable signal integrity. The place-and-route technology takes signal integrity into account during routing: Consequently, ~10 nets required post-layout SI optimization, from a total of ~500K nets.

## IV. SYSTEM-LEVEL DESIGN VALIDATION

The IC achieved functional and electrical design requirements with the first silicon.





Fig. 10a: IC Photomicrographs

Fig. 10b: ECSM correlation to circuit simulation (%)

Packaged parts have been validated with ATE

measurements. Additionally, ~15,000 system-level validation tests have been completed successfully to date, using the ARM RealView® validation system board. The LINUX operating system is currently in operation on the IC (with gaming applications being run on the IC; Fig. 10a).

#### V. SUMMARY OF DESIGN RESULTS

This methodology enabled single pass power/leakage optimization: 97.5% and 8.5% of the standard cells in the 1.0V and 0.8V domain, respectively, are of the high- $V_T$  type. Leakage power has been reduced by 46% in this design containing ~300K placeable objects and additional SRAM blocks. Overall power dissipation has been reduced by ~40% with respect to the conventional design while maintaining a 355 MHz clock rate under typical conditions (Tables 1, 2, 3). Methodology enhancements enabled a design schedule of 4 months from initial netlist to tape-out.

Table 1: Summarized Data on Reduction in Power Consumption

| Power Savings              | VDD Domains |       | Total |
|----------------------------|-------------|-------|-------|
|                            | 1.0V        | 0.8V  | Total |
| Dynamic                    | 12%         | 50.3% | 37.9% |
| Leakage                    | 69.3%       | 33.6% | 46.7% |
| <b>Total Power Savings</b> |             |       | 40.3% |

Table 2: Summarized Data on Power Consumption

|                           | Dynamic Power Dissipation (mW/MHz) |                                  |                                 |                                                 |  |
|---------------------------|------------------------------------|----------------------------------|---------------------------------|-------------------------------------------------|--|
| IC<br>Functional<br>Block | Simulated<br>Baseline<br>(90nm)    | Simulated<br>Low Power<br>(90nm) | Measured<br>Low Power<br>(90nm) | Measured<br>Power (130<br>nm; ARM<br>published) |  |
| Core                      | 0.28                               | 0.14                             | 0.1                             | 0.6                                             |  |
| Other                     | 0.36                               | 0.32                             | 0.21                            |                                                 |  |
| Total                     | 0.64                               | 0.46                             | 0.31                            |                                                 |  |

Table 3: Summarized IC Data

| Parameter        | Data                      |
|------------------|---------------------------|
| Clock Frequency  | 355 MHz (Typ. Conditions) |
| Technology       | TSMC 90G                  |
| Transistor Count | 22M                       |
| Core Voltage     | 1.0V, 0.8V                |
| I/O Voltage      | 3.3V                      |
| Pin Count        | 362                       |

#### VI. ACKNOWLEDGMENTS

We thank C. Chu, A. Gupta, L. Jensen, T. Valind, L. Milano, A. Iyer, P. Mamtora, J. Willis and M. McAweeney for their contributions.

#### VII. REFERENCES

- [1] Gartner- WW ASIC/ASSP, FPGA/PLD and SLI/SOC App. Fcst., 1Q04
- [2] B. Calhoun, "Ultra-Dynamic Voltage Scaling Using Sub-threshold Operation and Local Voltage Dithering in 90nm CMOS," ISSCC, 2/05
- [3] S. Henzler, "Sleep Transistor Circuits for Fine-Grained Power Switch-Off with Short Power-Down Times," ISSCC, Feb. 05
- [4] http://www.arm.com/pdfs/DUI0273B\_core\_tile\_user\_guide.pdf.
- [5] A. Khan et al., "Design and Development of 130-nanometer ICs for a Multi-Gigabit Switching Network System," CICC, Oct. 04
- [6] D. Desharnais, "Nanometer IC routing requires new approaches," EEDesign.com, Dec. 03
- [7] A. Khan et al., "A 150 MHz Graphics Rendering Processor with 256Mb Embedded DRAM," ISSCC, Feb. 2001
- 8] G. Paul, et al., "A Scalable 160Gb/s Switch Fabric Processor with 320Gb/s Memory Bandwidth," ISSCC, Feb. 04

774 27-2-4