# Reconsidering High-Speed Design Criteria for Transmission-Gate-Based Master–Slave Flip-Flops

Elio Consoli, Gaetano Palumbo, Fellow, IEEE, and Melita Pennisi, Member, IEEE

Abstract—In this paper we show that, when dealing with transmission-gate-based master-slave (TGMS) flip-flops (FFs), a reconsideration of the classical approach for the delay minimization is worthwhile to improve the performance in high-speed designs. In particular, by splitting such FFs into two sections that are separately optimized and then reconciling the results, the emerging design always outperforms the one resulting from the employment of a classical Logical Effort procedure assuming such FFs as a whole continuous path. Simulations are performed on several well-known TGMS FFs, designed in a 65-nm technology, to validate the correctness of such a procedure and of the underlying assumptions. Significant improvements are found on delay and, remarkably, on energy and area occupation, thus showing that this approach allows to correctly deal with the actual path effort in such circuits and hence to more properly steer the design towards the achievement of energy efficiency in the high-speed region.

*Index Terms*—Circuit optimization, flip-flops (FFs), high-speed, logical effort, master-slave, transmission-gate.

## I. INTRODUCTION

LIP-FLOPS (FFs) are the basic building blocks of datapath structures. Indeed, they allow for the storage of data processed by combinational circuits and the synchronization of operations at a given clock frequency [1]. Because of their multistage structure, high clock switching activity, and increasing portion of clock period occupied by their timing latency, the speed and energy of FFs significantly affect the overall performance of a datapath [2], [3].

Optimal FF design strategies are usually based on automated algorithms embedded directly into simulators [1], [3], [4]. These algorithms are powerful methods to optimize constraints such as speed, energy consumption, or energy-delay products, even for complicated FFs consisting of several internal nodes. Moreover, they also allow to account for the joint optimization of FFs and clock networks, for instance, through a proper clock slope setting [5]. Of course, the resulting design strategies will depend on the specific FF topology and on the design constraint to be optimized.

When FFs are placed into critical paths they need to exhibit a small data-to-output delay [3], and, as is well known, when circuit speed is the primary concern, Logical Effort (LE) approach can be employed [6]. Such a method is very useful for designers since it allows them to develop simple back-of-the-enve-

Manuscript received June 17, 2010; revised September 16, 2010; accepted December 01, 2010. Date of publication January 13, 2011; date of current version January 18, 2012.

The authors are with the Dipartimento di Ingegneria Elettrica, Elettronica e dei Sistemi (DIEES), University of Catania, I-95125 Catania, Italy (e-mail: econsoli@diees.unict.it; gpalumbo@diees.unict.it; mpennisi@diees.unict.it.

Digital Object Identifier 10.1109/TVLSI.2010.2098426

lope calculations to account for the speed performances even in the early design phase. Moreover, LE design constitutes a base also for the optimization of figures of merit that more heavily weigh delay with respect to energy, such as energy-delay products  $E^iD^j$  with j quite larger than i [4].

Transmission-gate (or pass-transistors)-based master–slave (TGMS) FFs are among the most popular and simplest FF topologies, and many of them have been proposed in the past [7]–[13]. Their features include a small area occupation, few internal nodes to be charged and discharged, and the absence of precharge. All of these factors lead to a small dissipation, and hence TGMS FFs can be effectively employed in energy-efficient microprocessors [14]–[17].

The focus of this paper is to provide helpful guidelines for the design of TGMS FFs in a high-speed environment by means of a reconsideration of the classical LE approach.

Traditionally, LE optimization is carried on by looking at the whole circuit as a unique uninterrupted path [1], [6]. Actually, we show that, for this specific class of circuits, the problem of delay minimization has to be looked at from a different perspective by resorting to a novel approach, which gets inspiration from preliminary considerations in [18]. The LE basis is still exploited but, unlike the traditional methodology, TGMS FFs are split into two overlapping sections and two different paths that are separately optimized. In particular, the paths considered are the first part of the one considered in the traditional methodology and the clock-to-output one. As will be shown, breaking the data-to-output path instead of considering it as a whole leads to the actual delay minimization. Remarkably, also energy consumption and area occupation of the resulting designs are always significantly lower than those obtained with the traditional LE method.

Therefore, this means that the actual path effort of TGMS FFs is more properly handled through such a new approach, whereas the traditional one fails to correctly catch it. These considerations can be practically exploited when sizing these circuits in the high-speed energy-efficient design region, i.e., as a base (or as a starting point) when accounting also for energy in the minimization of energy-delay products  $\mathrm{ED}^j$ , where j is significantly larger than i.

The remainder of this paper is organized as follows. In Section II, the main definitions about FFs timing are clarified and applied to the case of TGMS FFs. In Section III, the novel approach is discussed by revisiting the method of LE. In Section IV, three TGMS FFs are designed to exemplify the proposed approach. Comparisons with the traditional design strategy and with the results of a simulations-driven optimization are carried out in Section V. The analyses are performed



Fig. 1. Pipeline structure. (a) Clock signal. (b) FF timing.



Fig. 2. Typical  $(\tau_{\rm D-Q}, \tau_{\rm CK-Q})$  versus  $\tau_{\rm CK-D}$  curves  $(\tau_{\rm CK-D} = -\tau_{\rm D-CK})$ .

in a 65-nm technology by exploring various loading and input capacitance conditions. Finally, conclusions are drawn in Section VI. An Appendix is added to describe simulation setup.

## II. TIMING BEHAVIOR OF GENERIC AND TGMS FFS

Without loss of generality, let us refer to a generic stage of a datapath structure made up by a negative edge-triggered FF inserted between two combinational blocks, as shown in Fig. 1(a). The signal CK, clocking the FF, is reported in Fig. 1(b), which also highlights the falling edge of the clock, during which data is transferred from node D to node Q.

The timing behavior of the FF is featured by two main parameters: the data-to-clock  $\tau_{\rm D-CK}$  and the clock-to-output  $\tau_{\rm CK-Q}$  delays [1]. The overall timing overhead introduced by an FF and affecting the clock period duration is the sum of the above contributions, i.e., the data-to-output  $\tau_{\rm D-Q}$  delay [3]. To reduce the influence of FF timing on pipeline speed performances, the parameter  $\tau_{\rm D-Q}$  has to be minimized. Hence, it represents the actual figure of merit for FF speed [14]–[17].

When decreasing  $\tau_{\rm D-CK}$ , the effects on  $\tau_{\rm CK-Q}$  and  $\tau_{\rm D-Q}$  delays are opposite: as shown in Fig. 2, the former one increases, whereas the latter one initially decreases [2]. For this reason, the setup time  $t_{\rm setup}$  is defined as the optimum  $\tau_{\rm D-CK}$  leading to the



Fig. 3. Structure of a generic TG (or PT)-based FF.

minimum  $\tau_{\rm D-Q}$  [1] (see Fig.  $2^{\rm l})$  and the minimum propagation delay through the FF is

$$\tau_{\text{D-Q,min}} = t_{\text{setup}} + \tau_{\text{CK-Q,opt}}$$
 (1)

where  $\tau_{\rm CK-Q,opt}$  is the value of  $\tau_{\rm CK-Q}$  for  $\tau_{\rm D-CK}=t_{\rm setup}$ . When employing FFs in critical paths, it can be assumed that actually  $\tau_{\rm D-CK}=t_{\rm setup}$ , i.e., the pipeline is optimized to work in the minimum propagation delay condition [15].

FFs can be basically split into two topological categories: Pulsed FFs and MS FFs [16]. The former feature an internally or externally generated time window during which the FF is transparent to the input data. Such a time window implies: 1) a flat minimum region in the  $\tau_{\rm D-Q}$  versus  $\tau_{\rm D-CK}$  curve; 2) a negative  $t_{\rm setup}$ ; and 3) a continuous topological path from D to Q since D is the actual critical input when considering  $\tau_{\rm D-Q,min}$  as the figure of merit [1].

On the contrary, MS FFs are constituted by two latches that are alternately transparent according to the  $\phi$  value. This implies: 1) a high  $\tau_{\rm D-Q}$  to  $\tau_{\rm D-CK}$  sensitivity in the minimum region (as in the case of Fig. 2); 2) a positive  $t_{\rm setup}$ ; and 3) the presence of two distinct paths from the input node D to the boundary node between master and slave sections, and from this node to the output Q [1].

To exemplify the above discussions, let us consider the generic structure of a transmission-gate (TG) [or pass-transistor (PT)]-based MS (TGMS) FF shown in Fig. 3 (the depicted inverters can stand for generic combinational blocks, while keepers and/or feedback paths are not shown). The node X is the boundary between the master and slave sections and the paths relative to  $\tau_{\rm D-CK}, \tau_{\rm CK-Q}$ , and  $\tau_{\rm D-Q}$  delays are depicted with gray lines.

When  $\tau_{\rm D-CK}$  is sufficiently large, the input signal traverses the master latch and stops at node X, waiting for the slave TG to be enabled by the falling clock transition. After that, the input is transferred to the output.

On the contrary, when  $\tau_{\mathrm{D-CK}} = t_{\mathrm{setup}}$ , the last gate in the master section (henceforth referred as block A, as shown in Fig. 3) transfers its input nearly contemporarily to the enabling of the TG (henceforth referred as block B, as shown in Fig. 3) in the slave section. However, as will be shown in the following, the traditional assumption of an uninterrupted path from D to Q [1], [6] is not consistent.

 $^1\mathrm{Fig.}\ 2$  also shows the definition of the hold time, i.e., the clock-to-data delay that leads to a certain percentage increment (e.g., 5%) of  $\tau_{\mathrm{CQ}}$  with respect to its minimum value  $\tau_{\mathrm{CQ,min}}.$  Anyhow, hold-time related constraints are not a concern when dealing with MS topologies [1] such as those discussed in this paper, and hence no further considerations are added.

An indication of such an incongruence arises since, assuming the union of blocks A and B as a single stage (they are performing logical operations at the same time), it is not clear if the critical input signal to be considered is the input of block A or the CK signal enabling block B.

Therefore, in order to optimize the speed of TGMS FFs in terms of  $\tau_{\rm D-Q,min}$ , rather than applying the LE method to the whole D-Q path, a different approach may be required. In particular, we will show that  $t_{\rm setup}$  and  $\tau_{\rm CK-Q,opt}$  have to be separately handled.

## III. HIGH-SPEED DESIGN STRATEGY FOR TGMS FFS

## A. LE Basics and TG Related Considerations

The LE method allows to minimize the delay of a multistage path by equalling the stage efforts of the various stages [6], i.e., by setting

$$g_i h_i = \sqrt[N]{GBH} \quad \forall i = 1 \dots N \tag{2}$$

where N number of stages by which the path is made up;

 $g_i$  and  $h_i$  logical and electrical effort of the *i*th stage, respectively [6];

 $G = \prod g_i$  logical effort of the entire path [6];

 $\prod h_i$  = BH, where B is the branching effort of the entire path and  $H = (C_L)/(C_{\rm IN})$  is the electrical effort of the entire path, where  $C_L$  and  $C_{\rm IN}$  are the final output load and the input capacitance of the first stage, respectively [6].

The LE approach is introduced to handle CMOS gates with driving capability (static or dynamic), but can be extended to the case where TGs (or PTs) are present. In particular, in [6], it is stated that logical effort, electrical effort, and parasitic delay can be defined provided that TGs are incorporated to previous stages with driving capability.

Anyhow, they can be even more accurately extracted by applying the Elmore delay model, which is easily adaptable in a LE fashion [19], [20]. For this reason, in the remainder of this paper, the LE tool will be employed by basing on the more accurate Elmore delay interpretation. It is worth highlighting that such a modified LE basis is not the focus of the suggested approach. Indeed, it is used to deal both with the traditional and the suggested high-speed methodologies, but simply allows us to carry out a more effective sizing of the analyzed circuits in both cases (see Section IV). Nevertheless, all the reported results (see Sections V and VI) are actually extracted by means of simulations.

Finally, it is worth noting that, as suggested in [6], the most effective approach is to equally size the PMOS and NMOS transistors composing a TG.<sup>2</sup>

 $^2$ It can be assumed that equally sized PMOS/NMOS PTs exhibit a resistance equal to 4R/R when transferring a logic 0 and 2R/2R when transferring a logic "1" [6] (assuming NMOS mobility twice that of PMOS). Therefore, a TG with equally sized PMOS and NMOS transistors exhibits a resistance nearly equal to R for both "1" and "0" inputs. There is no point in increasing the size of the PMOS (as usually done in static/dynamic gates with driving capability) since, by sizing the PMOS twice the NMOS, the TG resistance is equal to (2/3)R for both "1" and "0" at the input but the capacitances at the input and output of the TG increase by 50%.



Fig. 4. Gates at the boundary between master and slave latches.

## B. Sizing at the Boundary Between Master and Slave

The general rule when considering a transparent TG driven by a static gate is to size its transistors equally to those of the preceding gate. For instance, in the usual case where the previous gate is a simple inverter (INV), the input of such an inverter will be the critical signal (the TG is transparent). The highest speed and symmetrical rising/falling behavior of the whole block INV+TG is achieved by sizing the PMOS of the INV with a width twice the NMOS (assuming NMOS mobility twice that of PMOS) and the transistors in the TG with the same width of the NMOS of the INV (see footnote 2).

When sizing transistors at the boundary between master and slave, as in the case of blocks A and B (see Fig. 3), the purpose is still to achieve symmetrical and minimum rising and falling delays. However, it is less evident how to set the relative size between the two blocks, since this time the TG can be enabled slightly before, contemporarily or slightly later than the time when combinational block (considered as a simple INV for simplicity) begins to transfer its input.

To resolve the doubt, we considered only the two blocks A and B, as shown in Fig. 4, and fed them directly with the D input, loading block B with a capacitance  $C_o$ . The minimum delay³ from D to output O was analyzed for various sizes of the INV and various values of  $C_o$ , by varying the size  $W_{\rm TG}$  of the TG (smaller, equal, or larger than the NMOS width in the INV,  $W_{\rm INV}$ ). Again, we found that a symmetrical and minimum delay is obtained by sizing the PMOS  $(2W_{\rm INV})$  twice the NMOS  $(W_{\rm INV})$  and  $W_{\rm TG} = W_{\rm INV}$ .

# C. Traditional Sizing Strategy to Maximize TGMS FF Speed

Once we found that the blocks at the boundary between the master and slave are identified by a single width, we need to set their absolute size, together with the size of the remaining stages in the TGMS FF, in order to minimize  $\tau_{\rm D-Q,min}$ .

The traditional approach would be to consider a unique path from D to Q and apply the LE method with a number of stages given by all of the gates in the master and slave sections.

The normalized width (with respect to the minimum value  $W_{\min}$ ) of the first stage,  $w_1$ , as well as the equivalent normalized width of the load  $w_L$ , are obviously assumed as known parameters. Keepers and feedback paths have typically a fixed size and, hence, lead to nonconstant branching effects, i.e., to nonlinearities [4]. Therefore, an iterative procedure is required to satisfy the LE optimum condition of an equal stage effort among all FFs stages.

 $^3\mathrm{In}$  the case of Fig. 4, the delay from D to O diminishes by decreasing the interarrival time between D and CK up to reaching a constant minimum. Conversely, in a whole TGMS FF, the delay increased after having reached the minimum, since a too small D and CK interarrival time means that the input is not well captured (or not captured at all) before TG in the master is disabled.

## D. Novel Sizing Strategy to Maximize TGMS FF Speed

The novel approach suggested in the paper breaks up the optimization in two steps. In particular, two LE optimizations are carried out to minimize the delays from input D to node X (path 1) and from CK (enabling the Slave TG) to output Q (path 2), and then the results are reconciled.

In the authors' opinion, such an approach is intuitively justifiable since the signals coming from D and CK, which traverse block B, would experience a different effort according to the classic interpretation. Hence, though blocks A and B nearly contemporarily act when the condition  $\tau_{D-CK} = t_{\text{setup}}$  is satisfied, two distinct overlapping paths can be identified. Such paths are not simply restricted to master and slave sections. Indeed, the first delay (up to node X) is influenced by the enabled block B in the slave (and hence by the input capacitance of the gate that follows block B) and the second delay is influenced by the resistance introduced by block A.

According to the above point of view, the overall path effort is hence more appropriately broken into two separate contributions and, rather than according to

$$opt(t_{setup} + \tau_{CK-Q,opt})$$
 (3)

the minimum  $\tau_{D-Q,\min}$  is actually found according to

$$\operatorname{opt}(t_{\text{setup}}) + \operatorname{opt}(\tau_{\text{CK-Q,opt}})$$
 (4)

where the notation  $\operatorname{opt}(T)$  means that the delay T is optimized by applying the LE method.

According to (4), two sets of LE parameters have to be derived for paths 1 and 2, and condition (2) is applied to both paths. Note that the input capacitance of the gate following block B is considered as the final load for path 1, while blocks A and B represent the first stage for path 2.

Further arrangements are necessary to properly define the LE equations according to the FF topology (the examples in the next section clarify many practical aspects). Nevertheless, we anticipate that, by separately optimizing path 1 and path 2 and then reconciling the results, a unique possible size for blocks A and B (and hence for all of the other gates) comes out, just like in the traditional LE approach.

Moreover, another aspect strengthening the above point of view concerns the role played by variability when TGMS FFs are employed in the critical paths of pipelined schemes.

Indeed, due to their high  $\tau_{\rm D-Q}$  to  $\tau_{\rm D-CK}$  sensitivity in the minimum region (see Section II), TGMS FFs have to be actually operated with a  $\tau_{\rm D-CK}$  slightly larger than  $t_{\rm setup}$  (which led to the very  $\tau_{\rm D-Q,min}$  delay) in order to guarantee a sufficient margin to absorb the impact of process-environmental variations and external clock skew-jitter. Yet, when employed in critical paths, TGMS FFs still obviously work under the condition where there is a certain overlap between the operations of the blocks A and B described in Section II, i.e., under the condition where the figure of merit for speed is still the  $\tau_{\rm D-Q}$  delay and not only the  $\tau_{\rm CK-Q}$  one (as in fast paths). Hence, an LE-based optimization targeting the D-Q path is still consistent to minimize the impact of TGMS FFs timing on the clock period.



Fig. 5. (a) Schematic of the TGMS FF proposed in [11]. LE parameters according to the (b) traditional and (c) proposed approaches.

Given all of the above, right due to the margin that has to be provided on  $\tau_{\rm D-CK}$ , the assumption of splitting the D-Q path in two sections becomes even more consistent and justifiable with respect to the traditional one<sup>4</sup>.

## IV. DESIGN EXAMPLES

## A. Modified Version of the PowerPC 603 FF

Here, we consider the typical TGMS FF shown in Fig. 5 and introduced in [11]. It is a modified version of the well-known FF employed in the PowerPC 603 low-power processor [8]. In particular, an inverter is added to isolate the D input and provide better noise immunity. The input is transferred to the output with inverted polarity,  $\bar{Q}$ , and simple gated keepers are employed.

 $^4\mathrm{When}$  increasing  $\tau_\mathrm{D-CK}$  with respect to  $t_\mathrm{setup}$ , we are getting closer to (but not really reaching) the condition in which block A fully completes its operation before block B is enabled. This reinforces the intuition according to which the paths up to and after node X have to be separately handled.

| Stage | Normalized (in LE fashion) Elmore delay d                                                                                        | Logical effort g               | Electrical effort h                                       | Parasitic delay p              |
|-------|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------|-----------------------------------------------------------|--------------------------------|
| 1     | $\frac{(5w_1)\frac{1}{w_1} + (2w_1 + 2 + 3w_2)\frac{2}{w_1}}{3}$                                                                 | 2                              | $\frac{2+3w_2}{3w_1}$                                     | 3                              |
| 2     | $ \frac{\left[ (5w_2 + 2)\frac{1}{w_2} + (2w_2 + 2 + 3w_3)\frac{2}{w_2} + \frac{(2w_2 + 2 + 3w_3)\frac{2}{w_2}}{3} \right] / 2 $ | $\left[2+\frac{2}{3}\right]/2$ | $\left[\frac{3+3w_3}{3w_2} + \frac{2+3w_3}{w_2}\right]/2$ | $\left[3+\frac{4}{3}\right]/2$ |
| 3     | $\frac{(3w_3 + 2 + w_L)\frac{1}{w_3}}{3}$                                                                                        | 1                              | $\frac{2+w_L}{3w_3}$                                      | 1                              |

TABLE I LE Parameters for the TGMS FF in [11] Considered as a Whole Path (From D to  $\bar{Q}$ ) With N=3 Stages

TABLE II LE PARAMETERS FOR THE TGMS FF IN [11] CONSIDERED AS THE UNION OF TWO PATHS EACH WITH N=2 STAGES

| Path-Stage | Normalized (in LE fashion)<br>Elmore delay d                     | Logical<br>effort g | Electrical<br>effort <b>h</b> | Parasitic<br>delay p |
|------------|------------------------------------------------------------------|---------------------|-------------------------------|----------------------|
| 1-1        | $\frac{(5w_1)\frac{1}{w_1} + (2w_1 + 2 + 3w_2)\frac{2}{w_1}}{3}$ | 2                   | $\frac{2+3w_2}{3w_1}$         | 3                    |
| 1-2        | $\frac{(5w_2+2)\frac{1}{w_2}+(2w_2+2+3w_3)\frac{1}{w_2}}{3}$     | 1                   | $\frac{4+3w_3}{3w_2}$         | $\frac{7}{3}$        |
| 2-1        | $\frac{(2w_2+2+3w_3)\frac{2}{w_2}}{3}$                           | $\frac{2}{3}$       | $\frac{2+3w_3}{w_2}$          | $\frac{4}{3}$        |
| 2-2        | $\frac{(3w_3 + 2 + w_L)\frac{1}{w_3}}{3}$                        | 1                   | $\frac{2+w_L}{3w_3}$          | 1                    |



Fig. 6. Schematic of the WPMS FF [12], [13].

The normalized widths (with respect to the minimum value  $W_{\rm min}$ ) of the various stages are highlighted in Fig. 5 (the keepers are minimum sized). In particular, the first INV + TG block (M1-M4) in the master has the width  $w_1$  given by the FF input capacitance specifications. Blocks A and B correspond to M5-M8 and are identified by a width  $w_2$ , while INV M9-M10 is identified by a width  $w_3$ 5.

If we consider the traditional LE approach in Section III-C, the LE parameters relative to the various stages are those in Table I ( $w_L$  is the equivalent load width). The stages corresponding to the LE parameters in Table I are shown in Fig. 5(b) for exemplification.

In this case, the LE method has to be applied by assuming a number of stages N=3 (nonlinear equations arise due to branching and hence the solution has to be found iteratively).

As anticipated, the Elmore delay model is applied to determine the expressions of delays of blocks M1-M4 and M5-M8 (from which LE parameters are then extracted). Note that, in Table I, capacitive terms are between parentheses and are multi-

 $^5 {\rm PMOS}~M2, M6, M10$  actually have widths  $2w_1, 2w_2,$  and  $2w_3,$  respectively.

plied by the resistances from each node to  $(V_{\rm DD})/({\rm GND})$ . Diffusion capacitance introduced by each transistor is equaled to its gate capacitance under the same width [6] (it has been verified that they are nearly equal).

Moreover, the resistance reduction exhibited by stacked transistors due to velocity saturation is neglected, since, in the adopted 65-nm technology, it is nearly compensated by strong channel length modulation and DIBL effects.

Regarding the parameters of the second stage, they are derived by averaging out two different cases, given here.

- 1) The input of INV M5-M6 is considered as the critical signal.
- 2) CK enabling TG M7–M8 is considered as the critical signal.

We verified that such an assumption leads to the best results for the traditional N=3 LE procedure with respect to the case of simply assuming the input of INV M5-M6 as the critical signal, and hence it is the fairest choice if one wants to point out any possible merit of the suggested approach.

Considering the proposed approach in Section III-D, two sets of LE parameters are derived for the N=2 paths 1 and 2

| TABLE III                                                                                 |
|-------------------------------------------------------------------------------------------|
| LE Parameters for the WPMS FF Considered as the Union of Two Paths Each With $N-2$ Stages |

| Path-Stage | Normalized (in LE fashion)<br>Elmore delay d                 | Logical<br>effort g | Electrical<br>effort <b>h</b>      | Parasitic<br>delay p |
|------------|--------------------------------------------------------------|---------------------|------------------------------------|----------------------|
| 1-1        | $\frac{(4w_1+1)\frac{1}{w_1}+(w_1+4+3w_2)\frac{5}{2w_1}}{3}$ | $\frac{5}{2}$       | $\frac{\frac{22}{5} + 3w_2}{3w_1}$ | $\frac{13}{6}$       |
| 1-2        | $\frac{(4w_2+1)\frac{1}{w_2}+(w_2+4+3w_3)\frac{1}{w_2}}{3}$  | 1                   | $\frac{5+3w_3}{3w_2}$              | $\frac{5}{3}$        |
| 2-1        | $\frac{(w_2+4+3w_3)\frac{5}{2w_2}}{3}$                       | $\frac{5}{6}$       | $\frac{4+3w_3}{w_2}$               | $\frac{5}{6}$        |
| 2-2        | $\frac{(3w_3 + w_L)\frac{1}{w_3}}{3}$                        | 1                   | $\frac{w_L}{3w_3}$                 | 1                    |

TABLE IV LE Parameters for the 2nd (3rd) Stage of the WPMS FF ( ${
m C^2MOS}$  FF) Considered as a Whole Path With N=3(N=4) Stages

| Stage                   | Normalized (in LE fashion) Elmore delay d                                                                                                                                             | Logical<br>effort g                        | Electrical<br>effort <b>h</b>                                              | Parasitic<br>delay p                        |
|-------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------|----------------------------------------------------------------------------|---------------------------------------------|
| 2<br>WPMS               | $\left[\frac{(4w_2+1)\frac{1}{w_2} + (w_2+4+3w_3)\frac{5}{2w_2}}{3} + \frac{(w_2+4+3w_3)\frac{5}{2w_2}}{3}\right]/2$                                                                  | $\left[\frac{5}{2} + \frac{5}{6}\right]/2$ | $\left[\frac{\frac{22}{5} + 3w_3}{3w_2} + \frac{4 + 3w_3}{w_2}\right] / 2$ | $\left[\frac{13}{6} + \frac{5}{6}\right]/2$ |
| 3<br>C <sup>2</sup> MOS | $ \frac{ \left[ \frac{([2w_3 + w_3]/2)\frac{1}{w_3} + (3w_3 + 2 + 3w_4 + [w_3 + 2w_3]/2)\frac{2}{w_3}}{3} + \frac{(3w_3 + 2 + 3w_4 + [2w_3 + w_3]/2)\frac{2}{w_3}}{3} \right]/2}{3} $ | [2 + 1]/2                                  | $\left[\frac{2+3w_4}{3w_3} + \frac{2+3w_4}{\frac{3}{2}w_3}\right]/2$       | $\left[\frac{7}{2}+3\right]/2$              |



Fig. 7. Schematic of the C<sup>2</sup>MOS FF [7].

(referred through the first subscript) and are reported in Table II. The stages corresponding to the LE parameters in Table II are shown in Fig. 5(c) for exemplification. Note that, obviously, the first and last rows of Tables I and II are equal.

As concerns the first delay, the Elmore model is applied to estimate the delay up to node X, while, as concerns the second delay, the capacitance at node X is assumed as already charged or discharged through M5-M6.

The application of condition (2) to both paths leads to

$$g_{1-1}h_{1-1} = g_{1-2}h_{1-2} = \sqrt{F_1} = \sqrt{G_1B_1H_1}$$
 (5)  

$$g_{2-1}h_{2-1} = g_{2-2}h_{2-2} = \sqrt{F_2} = \sqrt{G_2B_2H_2}$$
 (6)

$$g_{2-1}h_{2-1} = g_{2-2}h_{2-2} = \sqrt{F_2} = \sqrt{G_2B_2H_2}$$

$$G_1 = g_{1-1}g_{1-2}/G_2 = g_{2-1}g_{2-2} \tag{7}$$

$$B_1H_1 = h_{1-1}h_{1-2}/B_2H_2 = h_{2-1}h_{2-2}$$

where  $g_{i-j}(h_{i-j})$  is the logical (electrical) effort of the jth stage in the *i*th path.

According to Table II, (5)–(6) are solved by setting

$$w_2 = \frac{-6 + \sqrt{36 + 18(4 + 3w_3)(3w_1)}}{18}$$

$$w_3 = \frac{-6 + \sqrt{36 + 54(2 + w_L)(w_2)}}{18}.$$
(9)

$$w_3 = \frac{-6 + \sqrt{36 + 54(2 + w_L)(w_2)}}{18}.$$
 (10)

Equations (9)-(10) have to be satisfied to contemporarily minimize  $t_{\rm setup}$  and  $\tau_{\rm CK-Q,min}$  according to (4). Practically, by substituting (10) into (9) (or vice versa), a single variable equation comes out and  $w_2$  and  $w_3$  can be easily identified ( $w_1$ and  $w_L$  are given and a simple iterative cycle is sufficient to solve the arising nonlinear equation).

(6)

| Path-Stage | Normalized (in LE fashion)<br>Elmore delay d                                                | Logical<br>effort g | Electrical<br>effort <b>h</b>   | Parasitic<br>delay p |
|------------|---------------------------------------------------------------------------------------------|---------------------|---------------------------------|----------------------|
| 1-1        | $\frac{([2w_1 + w_1]/2)\frac{1}{w_1} + (3w_1 + 2 + 3w_2 + [w_1 + 2w_1]/2)\frac{2}{w_1}}{3}$ | 2                   | $\frac{2+3w_2}{3w_1}$           | $\frac{7}{2}$        |
| 1-2        | $\frac{(3w_2 + 2 + 3w_3)\frac{1}{w_2}}{3}$                                                  | 1                   | $\frac{2+3w_3}{3w_2}$           | 1                    |
| 1-3        | $\frac{([2w_3 + w_3]/2)\frac{1}{w_3} + (3w_3 + 2 + 3w_4 + [w_3 + 2w_3]/2)\frac{1}{w_3}}{3}$ | 1                   | $\frac{2+3w_4}{3w_3}$           | 2                    |
| 2-1        | $\frac{(3w_3 + 2 + 3w_4 + [2w_3 + w_3]/2)\frac{2}{w_3}}{3}$                                 | 1                   | $\frac{2+3w_4}{\frac{3}{2}w_3}$ | 3                    |
| 2-2        | $\frac{(3w_4 + 2 + w_L)\frac{1}{w_4}}{3}$                                                   | 1                   | $\frac{2+w_L}{3w_4}$            | 1                    |

TABLE V LE Parameters for the  ${
m C^2MOS}$  FF Considered as the Union of Two Paths With N=3 and N=2 Stages

#### B. Write-Port Master-Slave FF

The Write-Port MS (WPMS) FF [12], [13] is shown in Fig. 6. It is similar to the FF analyzed in the previous section but replaces TGs with PTs to reduce the clock load, employs partially nongated keepers and introduce additional logic to speed up the operation of the keepers that have to recover the threshold loss due to PTs.

In Table III, we report the LE parameters according to the proposed approach. The resistance of the PTs M3 and M6 in Fig. 6 is considered equal to 1/w when transferring a logic "0" and equal to 2/w when transferring a logic "1" [6]. This two values are combined thus leading to an average 3/(2w) resistance for PTs M3 and M6. PMOS transistors M2, M5, and M8 have widths  $2w_1, 2w_2$ , and  $2w_3$ , respectively.

The parameters of the traditional N=3 procedure are reported only for the intermediate stage 2 (first row of Table IV) given that those of the stages 1 and 3 are equal to those of stages 1-1 and 2-2 in the suggested N=2 procedure.

By applying (5)–(8), for the WPMS FF one finds

$$w_{2} = \frac{-\frac{66}{5} + \sqrt{\left(\frac{66}{5}\right) + 36(5 + 3w_{3})(3w_{1})\frac{2}{5}}}{18}$$

$$w_{3} = \frac{-12 + \sqrt{144 + 36(w_{L})(w_{2})\frac{6}{5}}}{18}.$$
(11)

$$w_3 = \frac{-12 + \sqrt{144 + 36(w_L)(w_2)\frac{6}{5}}}{18}.$$
 (12)

Both (11)–(12) can be combined and satisfied as for (9)–(10).

## C. $C^2MOS\ FF$

The  $C^2MOS$  MS FF [7] is shown in Fig. 7. It replaces full TGs simply with clocked gating transistors. Formally,  $C^2MOS$ is not a pure TG- (or PT-) based FF, but, as shown elsewhere [1], [20], the gated inverter is actually derived from an inverter plus a TG and, hence, the  $C^2MOS$  can be considered as a topology belonging to the class of TG- (or PT-) enabled MS FFs. Moreover, including it in our analysis allows to point out that our approach can be extended to FFs that are not "strictly TG- (or PT-) based, but which maintain the same topological features.

In Table V, we report the LE parameters according to the proposed procedure. Note that in the  $C^2MOS$  (see Fig. 7) there are two different nodes identifiable with the node X in Fig. 3. The





Fig. 8. First TGMS FF. (a) Delay. (b) Energy.

two paths are made up by N=3 and N=2 stages, respectively. PMOS transistors M2-M4, M6, M8-M10 and M12have widths  $2w_1, 2w_2, 2w_3$ , and  $2w_4$ , respectively.

Again, the parameters of the traditional N=4 procedure are reported only for the intermediate stage 3 (second row of Table IV) given that those of the stages 1, 2 and 4 are equal to those of stages 1-1, 1-2, and 2-2 in the suggested N=3procedure.





Fig. 9. First TGMS FF. (a) Relative percentage delay and (b) energy differences between the traditional and proposed approaches.

For the  $\mathrm{C}^2\mathrm{MOS}$  FF, given that path 1 has N=3 stages and the nonlinearities introduced by branching effects, it is not possible to derive closed-form expressions that can be reduced in a single variable equation as for (9)–(10) and (11)–(12). Anyhow, the arising system of nonlinear equations can be easily iteratively solved thus finding the  $w_2, w_3$  and  $w_4$  values satisfying conditions analogous to (5)–(8) for the  $\mathrm{C}^2\mathrm{MOS}$  FF.

#### V. SIMULATION AND COMPARISON

# A. Comparison With the Traditional Sizing Strategy

Starting from the LE parameters in Tables I–V, the considered FFs are sized according to the traditional and suggested approaches and the actual energy and delay are extracted by means of simulations. Various loading and input capacitance conditions are explored and, in particular,  $w_1$  is varied in the range [5–35], whereas  $w_L$  in the range  $3 \times [5–35]$  (i.e., a load equal to [5–35] minimum symmetrical inverters with  $W_p = 2W_N = 2W_{\min}$ ). For practical reasons, only the realistic cases where  $w_L > w_1$  are considered. The setup adopted to carry out the simulations and to estimate energy and delay is analogous to that adopted in [16] and is accurately described in the Appendix.



Fig. 10. Relative percentage delay and energy differences between the traditional and proposed approaches: WPMS (a-b),  $\rm C^2MOS$  (c-d).

The average delay (normalized to FO4 = 18.27 ps) and the energy dissipation under a 0.25 data input switching activity

| First<br>TGMS | Tradi<br>proce |                | Prop<br>proce |                       | Optim<br>algor | ization<br>rithm      |       | % error<br>tional<br>oc.) | Relative<br>(propose |                |
|---------------|----------------|----------------|---------------|-----------------------|----------------|-----------------------|-------|---------------------------|----------------------|----------------|
| $w_L - w_I$   | $w_2$          | w <sub>3</sub> | $w_2$         | <i>w</i> <sub>3</sub> | $w_2$          | <i>w</i> <sub>3</sub> | $w_2$ | $w_3$                     | $w_2$                | w <sub>3</sub> |
| 3 -1          | 1.3            | 1.1            | 1.2           | 1.0                   | 1              | 1                     | 20.0  | 0.0                       | 30.0                 | 10.0           |
| 3 -7          | 4.9            | 2.6            | 4.3           | 1.7                   | 3              | 2                     | 43.3  | -15.0                     | 63.3                 | 30.0           |
| 3 – 13        | 7.5            | 3.4            | 6.4           | 2.2                   | 6              | 2                     | 6.7   | 10.0                      | 25.0                 | 70.0           |
| 3 – 19        | 9.6            | 3.9            | 7.8           | 2.4                   | 7              | 2                     | 11.4  | 20.0                      | 37.1                 | 95.0           |
| 21 – 1        | 2.1            | 4.5            | 1.4           | 1.6                   | 1              | 2                     | 40.0  | -20.0                     | 110.0                | 125.0          |
| 21 – 7        | 8.4            | 9.6            | 5.6           | 5.0                   | 5              | 5                     | 12.0  | 0.0                       | 68.0                 | 92.0           |
| 21 – 13       | 12.8           | 12.0           | 8.5           | 6.1                   | 7              | 6                     | 21.4  | 1.7                       | 82.9                 | 100.0          |
| 21 – 19       | 16.6           | 13.7           | 11.3          | 7.0                   | 10             | 8                     | 13.0  | -12.5                     | 66.0                 | 71.3           |
| 39 – 1        | 2.6            | 7.1            | 1.9           | 3.5                   | 2              | 4                     | -5.0  | -12.5                     | 30.0                 | 77.5           |
| 39 – 7        | 10.2           | 14.6           | 7.5           | 7.3                   | 6              | 6                     | 25.0  | 21.7                      | 70.0                 | 143.3          |
| 39 – 13       | 15.6           | 18.2           | 11.3          | 9.1                   | 10             | 9                     | 13.0  | 1.1                       | 56.0                 | 102.2          |
| 39 – 19       | 20.2           | 20.8           | 14.6          | 10.4                  | 13             | 9                     | 12.3  | 15.6                      | 55.4                 | 131.1          |
| 57 – 1        | 2.9            | 9.2            | 2.1           | 4.6                   | 2              | 5                     | 5.0   | -8.0                      | 45.0                 | 84.0           |
| 57 – 7        | 11.6           | 19.0           | 8.3           | 9.4                   | 7              | 8                     | 18.6  | 17.5                      | 65.7                 | 137.5          |
| 57 – 13       | 17.6           | 23.5           | 12.7          | 11.7                  | 11             | 10                    | 15.5  | 17.0                      | 60.0                 | 135.0          |
| 57 – 19       | 22.8           | 26.8           | 15.3          | 13.2                  | 14             | 12                    | 9.3   | 10.0                      | 62.9                 | 123.3          |

 $TABLE\ VI \\ Error\ Between\ min.\ ED^4\ Sizings\ Extracted\ With\ Traditional/Proposed\ Procedures\ and\ an\ Optimization\ Algorithm$ 

(normalized to  $E_{\rm min}=0.202~{\rm fJ^6}$ ) are shown in Fig. 8(a)–(b), respectively, for the TGMS FF in Section IV-A, optimized according to the proposed procedure. The relative differences on delay and energy between the proposed sizing strategy and the traditional one are shown in Fig. 9(a)–(b), respectively.<sup>7</sup>

By inspection of results in the figures, the suggested procedure always outperforms the traditional one in terms of speed performance, with quantitative improvements ranging from 1% to 23% and increasing with  $w_L$ . Even more interestingly, the dissipation (and area) of the suggested approach is significantly lower than that of the traditional one, which reduces from 6% to 57% (increasing for larger  $w_L$ ).

Indeed, as concerns the sizing, the  $w_2$  and  $w_3$  values found with the proposed methodology are always lower than the ones found with the traditional approach (nearly by 30% and 50% factors). For instance, when considering the case with load equal to 16 minimum symmetrical inverters and  $w_1 = 4$ , the traditional approach would lead to  $[w_2, w_3] = [7.5/13.8]$ , while the proposed one to  $[w_2/w_3] = [5.4/6.8]$ . Despite the smaller sizing, the proposed approach leads to 6% (40%) better delay (energy).

The above results imply that, when optimizing  $\tau_{D-Q,min}$ , the assumption of two split paths is more consistent than that of a single path, which unnecessarily overestimates the actual path effort in the case of Master-Slave topologies.

For the sake of brevity, for the other two considered TGMS FFs (WPMS and  $C^2MOS$ ) we report only the relative percentage delay and energy differences in Fig. 10(a)–(d).

By inspection of results in the figures, it is apparent that the same considerations done for the first considered FF still hold.

In particular, by combining the above results, it is apparent that the energy-efficiency of the suggested sizing strategy is significantly improved. Therefore, the traditional sizing strategy,

 $^6$ It is the energy dissipated by an unloaded minimum symmetrical inverter for a complete 1  $\to$  0  $\to$  1 output transition.

 $^{7}$ The relative differences are obtained as  $(P_A-P_B)/[(P_A+P_B)/2]$ , being  $P_A$  the parameter (delay or energy) relative to the sizing strategy in Section III-C and  $P_B$  the parameter relative to the sizing strategy in Section III-D.

which assumes a TGMS FF as a whole path, does not actually correspond to the best solution in terms of an high-speed optimization that accounts for energy consumption too.

## B. Comparison With Results from an Optimization Algorithm

Given the above results, the suggested approach can constitute a base (or a starting point) for the optimization of this class of FFs in the high-speed region of the energy-delay space, that is the region where products  $E^iD^j$  with j significantly larger than i are minimized.

To verify this statement, the sizing strategies obtained with the proposed approach are compared with those resulting by applying a simulations-driven optimization algorithm (see [4], [16] for a detailed description) under the constraint of minimum  $\mathrm{ED}^4$  energy-delay product. The minimization of such a figure of merit exemplifies a design strategy that primarily targets speed [4], [15]–[17]. The optimizations are carried out by combining the ranges [1–19] and  $3 \times [1–19]$  for  $w_1$  and  $w_L$ , respectively (some rows, relative to nonpractical  $w_L \ll w_1$  cases, are highlighted in gray).

The  $w_2$  and  $w_3$  (and  $w_4$  for the  $C^2MOS$  FF) values obtained through the traditional procedure, through the proposed one in Section III-D and through the simulations-driven optimization algorithm are reported in Tables VI, VII, and VIII for the first TGMS FF, the WPMS, and the  $C^2MOS$ , respectively. By inspection of results, the relative percentage error of the proposed procedure in the sizes of transistors is moderate for all the considered FFs, typically within 20% except for few cases (due to the very small  $w_i$  values). On the contrary, it is apparent that the traditional approach leads to an unnecessary strong oversizing.

To further exemplify the energy-delay space region where it is worth using such an approach, in Table IX, we report the sizing, energy and delay for the proposed approach, minimum  $ED^4$  and minimum ED designs, in the case of the first considered TGMS FF, with 13 loading inverters and  $w_1 = [1, 7, 13, 19]$ .

It is apparent that the designs arising with the proposed methodology are close to the energy-efficient one in the

 ${\it TABLE~VII} \\ {\it Error~Between~min.}~ED^4~Sizings~Extracted~With~Traditional/Proposed~Procedures~and~an~Optimization~Algorithm~Color of the color of the col$ 

| WPMS<br>FF  | Traditional procedure |                | Proposed<br>procedure |       | Optim<br>algor |                |       | % error<br>tional<br>oc.) | Relative % error (proposed proc.) |                |  |
|-------------|-----------------------|----------------|-----------------------|-------|----------------|----------------|-------|---------------------------|-----------------------------------|----------------|--|
| $w_L - w_I$ | $w_2$                 | w <sub>3</sub> | $w_2$                 | $w_3$ | $w_2$          | w <sub>3</sub> | $w_2$ | W3                        | $w_2$                             | w <sub>3</sub> |  |
| 3 -1        | 1.0                   | 1.0            | 1.0                   | 1.0   | 1              | 1              | 0.0   | 0.0                       | 0.0                               | 0.0            |  |
| 3 -7        | 4.2                   | 1.5            | 2.3                   | 1.0   | 3              | 1              | -23.3 | 0.0                       | 40.0                              | 50.0           |  |
| 3 - 13      | 7.8                   | 2.1            | 5.3                   | 1.4   | 6              | 1              | -11.7 | 40.0                      | 30.0                              | 110.0          |  |
| 3 - 19      | 10.3                  | 3.1            | 8.2                   | 1.8   | 8              | 2              | 2.5   | -10.0                     | 28.8                              | 55.0           |  |
| 21 – 1      | 1.9                   | 3.0            | 1.4                   | 2.0   | 1              | 2              | 40.0  | 0.0                       | 90.0                              | 50.0           |  |
| 21 – 7      | 6.1                   | 4.8            | 4.2                   | 3.6   | 4              | 3              | 5.0   | 20.0                      | 52.5                              | 60.0           |  |
| 21 – 13     | 10.5                  | 7.4            | 8.9                   | 5.4   | 8              | 4              | 11.3  | 35.0                      | 31.3                              | 85.0           |  |
| 21 – 19     | 14.8                  | 10.1           | 12.2                  | 6.9   | 12             | 6              | 1.7   | 15.0                      | 23.3                              | 68.3           |  |
| 39 – 1      | 2.3                   | 4.2            | 2.1                   | 3.4   | 2              | 3              | 5.0   | 13.3                      | 15.0                              | 40.0           |  |
| 39 – 7      | 10.2                  | 11.4           | 8.8                   | 8.1   | 8              | 7              | 10.0  | 15.7                      | 27.5                              | 62.9           |  |
| 39 – 13     | 14.7                  | 14.9           | 12.8                  | 10.3  | 11             | 9              | 16.4  | 14.4                      | 33.6                              | 65.6           |  |
| 39 – 19     | 17.7                  | 16.0           | 15.4                  | 12.0  | 15             | 11             | 2.7   | 9.1                       | 18.0                              | 45.5           |  |
| 57 – 1      | 3.5                   | 6.0            | 2.8                   | 4.1   | 3              | 4              | -6.7  | 2.5                       | 16.7                              | 50.0           |  |
| 57 – 7      | 12.1                  | 13.2           | 9.2                   | 9.4   | 9              | 8              | 2.2   | 17.5                      | 34.4                              | 65.0           |  |
| 57 – 13     | 15.0                  | 17.1           | 12.4                  | 12.3  | 12             | 11             | 3.3   | 11.8                      | 25.0                              | 55.5           |  |
| 57 – 19     | 19.4                  | 22.5           | 16.3                  | 14.9  | 15             | 14             | 8.7   | 6.4                       | 29.3                              | 60.7           |  |

 $TABLE\ VIII$  Error Between Min.  $ED^4$  Sizings Extracted With Traditional/Proposed Procedures and an Optimization Algorithm

| C <sup>2</sup> MOS<br>FF | Traditional<br>procedure |                |       |       | Proposed procedure |       |       | Optimization<br>algorithm |       |       | Relative % error<br>(traditional proc.) |       |       | Relative % error (proposed proc.) |       |  |
|--------------------------|--------------------------|----------------|-------|-------|--------------------|-------|-------|---------------------------|-------|-------|-----------------------------------------|-------|-------|-----------------------------------|-------|--|
| $w_L - w_I$              | $w_2$                    | w <sub>3</sub> | $w_4$ | $w_2$ | w <sub>3</sub>     | $w_4$ | $w_2$ | $w_3$                     | $w_4$ | $w_2$ | $w_3$                                   | $w_4$ | $w_2$ | w <sub>3</sub>                    | $w_4$ |  |
| 3 -1                     | 1                        | 1              | 1     | 1     | 1                  | 1     | 1     | 1                         | 1     | 0,0   | 0,0                                     | 0,0   | 0.0   | 0.0                               | 0.0   |  |
| 3 -7                     | 3                        | 2              | 2     | 3     | 2                  | 1     | 2     | 2                         | 1     | 50,0  | 0,0                                     | 0,0   | 50.0  | 0.0                               | 100.0 |  |
| 3 - 13                   | 6                        | 4              | 3     | 4     | 2                  | 1     | 4     | 3                         | 1     | 0,0   | -33,3                                   | 0,0   | 50.0  | 33.3                              | 200.0 |  |
| 3 - 19                   | 9                        | 6              | 4     | 7     | 4                  | 2     | 6     | 5                         | 2     | 16,7  | -20,0                                   | 0,0   | 50.0  | 20.0                              | 100.0 |  |
| 21 – 1                   | 2                        | 3              | 4     | 1     | 2                  | 3     | 1     | 2                         | 2     | 0,0   | 0,0                                     | 50,0  | 100.0 | 50.0                              | 100.0 |  |
| 21 – 7                   | 7                        | 6              | 8     | 4     | 4                  | 4     | 4     | 4                         | 5     | 0,0   | 0,0                                     | -20,0 | 75.0  | 50.0                              | 60.0  |  |
| 21 – 13                  | 10                       | 8              | 8     | 7     | 6                  | 5     | 6     | 5                         | 5     | 16,7  | 20,0                                    | 0,0   | 66.7  | 60.0                              | 60.0  |  |
| 21 – 19                  | 12                       | 10             | 9     | 9     | 7                  | 5     | 7     | 7                         | 5     | 28,6  | 0,0                                     | 0,0   | 71.4  | 42.9                              | 80.0  |  |
| 39 – 1                   | 1                        | 3              | 5     | 1     | 2                  | 4     | 1     | 2                         | 3     | 0,0   | 0,0                                     | 33,3  | 0.0   | 50.0                              | 66.7  |  |
| 39 – 7                   | 6                        | 6              | 7     | 5     | 6                  | 7     | 5     | 5                         | 6     | 0,0   | 20,0                                    | 16,7  | 20.0  | 20.0                              | 16.7  |  |
| 39 – 13                  | 9                        | 8              | 9     | 7     | 7                  | 7     | 6     | 6                         | 7     | 16,7  | 16,7                                    | 0,0   | 50.0  | 33.3                              | 28.6  |  |
| 39 – 19                  | 12                       | 10             | 10    | 10    | 9                  | 8     | 8     | 8                         | 8     | 25,0  | 12,5                                    | 0,0   | 50.0  | 25.0                              | 25.0  |  |
| 57 – 1                   | 1                        | 3              | 6     | 1     | 2                  | 5     | 1     | 2                         | 4     | 0,0   | 0,0                                     | 25,0  | 0.0   | 50.0                              | 50.0  |  |
| 57 – 7                   | 6                        | 7              | 9     | 5     | 6                  | 8     | 5     | 5                         | 7     | 0,0   | 20,0                                    | 14,3  | 20.0  | 40.0                              | 28.6  |  |
| 57 – 13                  | 10                       | 10             | 12    | 8     | 9                  | 10    | 7     | 8                         | 9     | 14,3  | 12,5                                    | 11,1  | 42.9  | 25.0                              | 33.3  |  |
| 57 – 19                  | 12                       | 12             | 14    | 11    | 11                 | 11    | 9     | 9                         | 10    | 22,2  | 22,2                                    | 10,0  | 33.3  | 33.3                              | 40.0  |  |

TABLE IX Sizing, Energy and Delay for the Proposed Procedure, min.  ${\rm ED^4}$  and min. ED Sizings in Some Reference Cases

| First<br>TGMS |                | Proposed       | procedure                  |                |       | Minimum        | ED <sup>4</sup> sizing     |                | Minimum ED sizing |                |                            |                |  |
|---------------|----------------|----------------|----------------------------|----------------|-------|----------------|----------------------------|----------------|-------------------|----------------|----------------------------|----------------|--|
| $w_L - w_I$   | w <sub>2</sub> | w <sub>3</sub> | Energy [E <sub>min</sub> ] | Delay<br>[FO4] | $w_2$ | w <sub>3</sub> | Energy [E <sub>min</sub> ] | Delay<br>[FO4] | $w_2$             | w <sub>3</sub> | Energy [E <sub>min</sub> ] | Delay<br>[FO4] |  |
| 39 – 1        | 2.1            | 3.4            | 48.05                      | 4.09           | 2     | 3              | 47.31                      | 4.08           | 1                 | 2              | 22.8                       | 5.0            |  |
| 39 – 7        | 8.8            | 8.1            | 156.07                     | 2.87           | 8     | 7              | 148.12                     | 2.90           | 3                 | 3              | 55.6                       | 4.0            |  |
| 39 – 13       | 12.8           | 10.3           | 241.45                     | 2.67           | 11    | 9              | 222.48                     | 2.71           | 4                 | 5              | 104.9                      | 3.4            |  |
| 39 – 19       | 15.4           | 12.0           | 321.00                     | 2.60           | 15    | 11             | 314.29                     | 2.64           | 6                 | 6              | 160.7                      | 3.2            |  |

high-speed region of the E-D space, i.e., results further demonstrates that the proposed procedure allows to closely approach the energy-efficient sizings minimizing the figures of merit where speed is the primary concern.

In general, minimum delay designs represent a bound for the design space, can be used as starting point for the optimized search within it, and many useful properties can be derived from their analysis [4]. For this reason, a proper revision of the traditional LE method is necessary when optimizing circuits employing TGMS FFs.

#### VI. CONCLUSION

In this work, design approaches to deal with the high-speed sizing of TGMS FFs have been reconsidered. The goal of delay minimization is pursued by splitting the whole structure in two different paths and from this consideration two separate LE optimizations are carried out with a subsequent reconciliation of the results. Such a methodology has been applied on three well known TGMS FFs in a 65-nm CMOS technology and compared with the results derived by traditionally considering the FFs as whole uninterrupted paths.

The proposed procedure has been shown to achieve better delay and energy, i.e., it allows to steer the design towards a proper handling of the actual path effort of such structures and, hence, to significantly improve the performance when designing for high-speed. Indeed, the comparison with results from a simulations-driven optimization algorithm has shown that the proposed methodology allows to closely approach the energy-efficient sizings in the high-speed energy-delay region.

As a final comment, technology scaling should not affect the efficaciousness of the proposed methodology, as long as LE modeling approach is still consistent. Despite the increasing importance of several effects arising in nanometer technologies, it is always possible to extract sufficiently accurate LE parameters g and p, if anything not using the simple hand calculations but characterizing them through simulations. Once such parameters are accurately extracted for the basic building blocks (e.g., inverters, gates inverters, inverters plus TGs) in a particular technology, the method effectiveness should be unchanged with technology scaling.

#### APPENDIX

SIMULATION SETUP AND EVALUATION OF FFS ENERGY

The reference test bench circuit for simulations is analogous to that adopted in [1]–[5], [15], [16].

Depending on the load value, a further "load on the load" (four times larger than the load itself) is used to avoid an excessive amplification of the  $C_{\rm GD}$  load capacitances.

A pair of series inverters is employed to feed both the clock and data inputs to the FF. The first one realistically shapes the input waveform, whereas the second is sized to achieve the desired slope on the input signals.

A constant FO3 slope policy is adopted with regard to the clock signal fed to the FF (including the parasitic capacitances) [1]–[3], while the slope of the data input is always equal to the estimated transition time (through LE calculations) at the output of the data-driven FF first stage [5], [16]. Indeed, in realistic pipelines, the circuit driving the FF has a speed similar to the FF itself.

The average energy per clock cycle is due to the dynamic and static contributions, which are separately estimated. The first one is evaluated by referring to all the possible combinations between data and clock transitions  $(1 \to 0 \text{ and } 0 \to 1)$  and output states (0 and 1). Twelve energy elementary contributions come out, which are properly weighed according to the switching activity of the data input,  $\alpha_{sw}$ , in order to find the average dynamic energy in one clock cycle [16]. These contributions are evaluated integrating the supply current between the beginning of the transitions and the time when the slowest among the FF nodes reaches its steady values (the energy needed to charge the clock and data input capacitance is also included [1]–[3], whereas the average energy spent on the load is subtracted from the whole computation).

The static energy contribution is due to leakage currents. It is evaluated by averaging the eight possible static currents according to the data, clock and output steady states. In particular, we considered a  $T_{\rm CK}/FO4=40$  logic depth, which is typical in practical energy-efficient microprocessors.

#### REFERENCES

- [1] V. Oklobdzija, V. Stojanovic, D. Markovic, and N. Nedovic, *Digital System Clocking: High-Performance and Low-Power Aspects.* New York: Wiley-Interscience, 2003.
- [2] V. Oklobdzija, "Clocking and clocked storage elements in a multi-Gi-gaHertz environment," *IBM J. Res. Devel.*, vol. 47, no. 5/6, pp. 567–583, Sept./Nov. 2003.
- [3] V. Stojanovic and V. Oklobdzija, "Comparative analysis of masterslave latches and flip-flops for high-performance and low-power systems," *IEEE J. Solid-State Circuits*, vol. 34, no. 4, pp. 536–548, Apr. 1999
- [4] M. Alioto, E. Consoli, and G. Palumbo, "General strategies to design nanometer flip-flops in the energy-delay space," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 7, pp. 1583–1596, Jul. 2010.
- [5] M. Alioto, E. Consoli, and G. Palumbo, "Flip-flop energy/performance versus clock slope and impact on the clock network design," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 57, no. 6, pp. 1273–1286, Jun. 2010.
- [6] I. Sutherland, B. Sproull, and D. Harris, Logical Effort: Designing Fast CMOS Circuits. San Mateo, CA: Morgan Kaufmann, 1998.
- [7] Y. Suzuki, K. Odagawa, and T. Abe, "Clocked CMOS calculator circuitry," *IEEE J. Solid-State Circuits*, vol. SSC-8, no. 6, pp. 462–469, Dec. 1973.
- [8] G. Gerosa et al., "A 2.2 W, 80 MHz superscalar RISC microprocessor," IEEE J. Solid-State Circuits, vol. 29, no. 12, pp. 1440–1452, Dec. 1994.
- [9] R. Llopis and M. Sachdev, "Low power, testable dual edge triggered flip-flops," in *Proc. ISLPED*, Aug. 1996, pp. 341–345.
- [10] U. Ko and P. Balsara, "High-performance energy-efficient D-flip-flop circuits," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 8, no. 1, pp. 94–98, Feb. 2000.
- [11] D. Markovic, B. Nikolic, and R. Brodersen, "Analysis and design of low-energy flip-flops," in *Proc. ISLPED*, Aug. 2001, pp. 52–55.
- [12] D. Markovic, J. Tschanz, and V. De, "Transmission-Gate Based Flip-Flop," U.S. Patent 6 642 765, Nov. 4, 2003.
- [13] S. Hsu, S. Mathew, M. Anders, B. Bloechel, R. Krishnamurthy, and S. Borkar, "A 110 GOPS/W 16b multiplier and reconfigurable PLA loop in 90 nm CMOS," in *Proc. ISSCC*, Feb. 2005, vol. 1, pp. 376–377.
- [14] H. Partovi, "Clocked Storage Elements," in *Design of High-Performance Microprocessor Circuits*. New York: IEEE Press, 2001, pp. 207–234.
- [15] C. Giacomotto, N. Nedovic, and V. Oklobdzija, "The effect of the system specification on the optimal selection of clocked storage elements," *IEEE J. Solid-State Circuits*, vol. 42, no. 6, pp. 1392–1404, Jun. 2007.
- [16] M. Alioto, E. Consoli, and G. Palumbo, "Analysis and comparison in the energy-delay-area domain of nanometer CMOS flip-flops: Part I—Methodologies and design strategies," *IEEE TVLSI* [On-line]. Available: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&ar-number=5419974ù print on, Online:
- [17] M. Alioto, E. Consoli, and G. Palumbo, "Analysis and comparison in the energy-delay-area domain of nanometer CMOS flip-flops: Part II—Results and figures of merit," *IEEE Trans. Very Large Scale Integr.* (VLSI) Syst. DOI: 10.1109/TVLSI.2010.2041377.
- [18] G. Palumbo and M. Pennisi, "Design guidelines for high-speed transmission-gate latches: Analysis and comparison," in *Proc. IEEE ICECS*, Aug.—Sept. 2008, pp. 145–148.
- [19] A. Morgenshtein, E. Friedman, R. Ginosar, and A. Kolodny, "Unified logical effort—A method for delay evaluation and minimization in logic paths with RC interconnect," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 18, no. 5, pp. 689–696, May 2010.
- [20] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and System Perspective, 3rd ed. Reading, MA: Addison-Wesley, 2004.



Elio Consoli was born in Catania, Italy, in 1983. He received the M.S. degree in microelectronic engineering from the University of Catania, Catania, Italy, in 2008, where he is currently working toward the Ph.D. degree at the Department of Electrical, Electronic, and Systems Engineering.



**Gaetano Palumbo** (F'07) was born in Catania, Italy, in 1964. He received the Laurea degree in electrical engineering and Ph.D. degree from the University of Catania, Catania, Italy, in 1988 and 1993, respectively.

Since 1993, he has conducted courses on Electronic Devices, Electronics for Digital Systems and basic Electronics. In 1994, he joined the Dipartimento Elettrico Elettronico e Sistemistico, now the Dipartimento di Ingegneria Elettrica Elettronica e dei Sistemi (DIEES), University of Catania, Catania,

Italy, as a Researcher, subsequently becoming an Associate Professor in 1998. Since 2000, he has been a Full Professor with the same department. His primary research interest has been analog circuits with particular emphasis on feedback circuits, compensation techniques, current-mode approach, and low-voltage circuits. His research has also embraced digital circuits with emphasis on bipolar and MOS current-mode digital circuits, adiabatic circuits, and high-performance building blocks focused on achieving optimum speed within the constraint of low power operation. In all of these fields, he is developing research activities in collaboration with STMicroelectronics of Catania. He was the coauthor of the books CMOS Current Amplifiers (Kluwer, 1999), Feedback Amplifiers: Theory and Design (Kluwer, 2001), and Model and Design of Bipolar and MOS Current-Mode Logic (CML, ECL and SCL Digital Circuits (Kluwer, 2005) and a textbook on electronic device in 2005. He is the author or coauthor of over 350 scientific papers on referred international journals (150) and in conferences. Moreover, he is the coauthor of several patents.

Dr. Palumbo served as an associate editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS for the topics "Analog Circuits and Filters" and "Digital Circuits and Systems," respectively, from June 1999 to the end of 2001 and from 2004 to 2005. From 2006 to 2007, he served as an associate editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS. Since 2008, he has served as an associate editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS. In 2005, he was one of 12 panelists in the scientific-disciplinary area 09 - industrial and information engineering of the Committee for Evaluation of Italian Research, which aims at evaluating the Italian research in the above area for the period 2001-2003. In 2003, he was the recipient of the Darlington Award. Since 2011 he is a member of the Board of Governors of the IEEE Circuits and Systems Society.



Melita Pennisi (M'08) was born in Catania, Italy, in 1980. She received the Laurea degree in electronics engineering and the Ph.D. degree in electronics and automation engineering from the University of Catania, Catania, Italy, in 2004 and 2008, respectively.

Since 2008, she has been a Researcher with the Dipartimento di Ingegneria Elettrica Elettronica e dei Sistemi (DIEES), University of Catania, Catania, Italy. She is the author or coauthor of more than 15 publications in international journals and conference

proceedings. Her primary research interests include the modeling and optimized design of high-performance CMOS, analysis of analog nonlinear circuits, behavioral modeling of complex mixed-signal circuits, and design/modeling for variability-tolerant and low-leakage VLSI circuits.