## Index

*Note*: Online information is listed by chapter and section number followed by page numbers (OL3.11-7). Page references preceded by a single letter with hyphen refer to appendices.

| A-619, A-621 for ARM cortex-A8, 483 Arithmetic logic unit (ALU)  defining in Verilog, A-623-626 for Intel core 17, 483 bits, 272 logic, C-6 from 31 copies of 1-bit ALU, A-622 illustrated, A-624 ripple carry adder, A-617 Addresses with 32 1-bit ALUs, A-618 byte, 70 defined, 69 generating ALU control bits, C-6  AArch32, 73 base, 120 AMD64, 155, 156, 232, OL2.2-6  AArch4, 73 in branches, 117-120 displacement, 120 displacement, 120 displacement, 120 displacement, 120 displacement, 120 displacement, 120 defined, 49  Abostute references, 131 Absolute references, 131 Absolute rachitectures, OL2.22-2  Aronyms, 9 Addressing modes ADD (add), 64 ADDI (add immediate, 64  ADDI (add immediate), 64  ADDIS (add immediate), 64  ADDIS (add immediate), 64  Addition, 188-191, See also Arithmetic binary, 188-189, 188-189, 64, 164  Addition, 188-191, See also Arithmetic binary, 188-189  floating-point, 212-215, 220 operands, 189  significands, 211 speed, 191  Address select logic, C-24, C-25  Address select logic, C-24, C-25  Allegment restriction, 71  Address select logic, C-24, C-25  Allegment restriction, 71  Allegment restriction, 71  Allegria Point, 212-215, 240  Address select logic, C-24, C-25  Allegraphic, 483  Arithmetic logic unit (ALU)  bits, 272  TIB mapping to gates, C-4-7  truth tables, C-5  ALU control block, 275  defined, C-4  appined are mapping to gates, C-4-7  truth tables, C-5  ALU control block, 275  defined, C-4  appined are mapping to gates, C-4-7  truth tables, C-5  ALU control block, 275  defined, C-4  appined are mapping to gates, C-4-7  truth tables, C-5  ALU control block, 275  defined, C-4  appined, B-4  Alu Control block, 275  defined, C-4  appined, B-4  appined are mapping to gates, C-4-7  truth tables, C-5  ALU control block, 275  defined, C-4  appined, B-4  Alu Control block, 275  defined, C-4  appined, B-4  appined | 1-bit ALU, A-614–617. See also Arithmetic logic unit (ALU) adder, A-615 CarryOut, A-616 for most significant bit, A-621 illustrated, A-617 logical unit for AND/OR, A-615 performing AND, OR, and addition,                       | extending, 493 flat, 493 ID (ASID), 460 inadequate, OL5.17-6 shared, 533–534 single physical, 533–534 virtual, 460 Address translation | bit count instructions, D-29<br>floating-point instructions, D-28<br>instructions, D-27–29<br>no divide, D-28<br>PAL code, D-28<br>unaligned load-store, D-28<br>VAX floating-point formats, D-29<br>ALU control, 271–273. See also |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| A         memory, 79         control signal, 275           Addressing         Amazon Web Services (AWS), 439           AArch32, 73         base, 120         AMD Opteron X4 (Barcelona), 559, 560           AArch64, 73         in branches, 117–120         Amdahl's law, 415, 519           Absolute references, 131         displacement, 120         corollary, 49           Abstractions         immediate, 120         defined, 49           hardware/software interface, 22         LEGv8 modes, 120–121         fallacy, 572           principle, 22         PC-relative, 118, 120         and (AND), 64           to simplify design, 11         register, 120         AND gates, A-600, C-7           Accoumulator architectures, OL2.22-2         x86 modes, 158         AND operation, 91           Active matrix, 18         desktop architectures, D-6         AND operation, A-594           ADD (add), 64         ADDS (add and set flags), 64, 164         Annual failure rate (AFR), 432           ADDI (add immediate), 64         Advanced Encryption Standard (AES)         Antidependence, 348           Addition, 188–191. See also Arithmetic binary, 188–189         Advanced Vector Extensions (AVX), 232, 189         Apple iPad 2 A1395, 20           floating-point, 212–215, 220         240         Apple iPad 2 A1395, 20           operands, 189         AGP, B-9 <th< td=""><td>64-bit ALU, A-617–626. See also Arithmetic logic unit (ALU) defining in Verilog, A-623–626 from 31 copies of 1-bit ALU, A-622 illustrated, A-624 ripple carry adder, A-617 tailoring to MIPS, A-619–623 with 32 1-bit ALUs, A-618</td><td>defined, 443 fast, 452–454 for Intel core i7, 483 TLB for, 452–454 Address-control lines, C-26 Addresses base, 69</td><td>bits, 272<br/>logic, C-6<br/>mapping to gates, C-4–7<br/>truth tables, C-5<br/>ALU control block, 275<br/>defined, C-4<br/>generating ALU control bits, C-6</td></th<>                                                                                                                                                                   | 64-bit ALU, A-617–626. See also Arithmetic logic unit (ALU) defining in Verilog, A-623–626 from 31 copies of 1-bit ALU, A-622 illustrated, A-624 ripple carry adder, A-617 tailoring to MIPS, A-619–623 with 32 1-bit ALUs, A-618 | defined, 443 fast, 452–454 for Intel core i7, 483 TLB for, 452–454 Address-control lines, C-26 Addresses base, 69                      | bits, 272<br>logic, C-6<br>mapping to gates, C-4–7<br>truth tables, C-5<br>ALU control block, 275<br>defined, C-4<br>generating ALU control bits, C-6                                                                               |
| AArch64, 73 Absolute references, 131 Absolute references, 131 Abstractions immediate, 120 bardware/software interface, 22 principle, 22 principle, 22 principle, 22 promulator architectures, OL2.22-2 Acronyms, 9 Active matrix, 18 ADD (add), 64 ADDI (add immediate), 64 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address interleaving, 395 | A                                                                                                                                                                                                                                 | memory, 79<br>virtual, 442, 462, 463<br>Addressing                                                                                     | control signal, 275<br>Amazon Web Services (AWS), 439<br>AMD Opteron X4 (Barcelona), 559, 560                                                                                                                                       |
| Absolute references, 131 displacement, 120 corollary, 49 Abstractions immediate, 120 defined, 49 hardware/software interface, 22 LEGv8 modes, 120–121 fallacy, 572 principle, 22 PC-relative, 118, 120 and (AND), 64 to simplify design, 11 register, 120 AND gates, A-600, C-7 Accumulator architectures, OL2.22-2 x86 modes, 158 AND operation, 91 Acronyms, 9 Addressing modes AND operation, A-594 Active matrix, 18 desktop architectures, D-6 andi (And Immediate), 65 ADD (add), 64 ADDS (add and set flags), 64, 164 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 Advanced Encryption Standard (AES) floating-point, 212–215, 220 240 Apple iPad 2 A1395, 20 operands, 189 AGP, B-9 logic board of, 20 significands, 211 Algol-60, OL2.22-7 processor integrated circuit of, 21 speed, 191 Aliasing, 458, 459 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                                                                                                                                   |                                                                                                                                        |                                                                                                                                                                                                                                     |
| Abstractions immediate, 120 defined, 49 hardware/software interface, 22 LEGv8 modes, 120–121 fallacy, 572 principle, 22 PC-relative, 118, 120 and (AND), 64 to simplify design, 11 register, 120 AND gates, A-600, C-7 Accumulator architectures, OL2.22-2 x86 modes, 158 AND operation, 91 Active matrix, 18 desktop architectures, D-6 andi (And Immediate), 65 ADD (add), 64 ADDS (add and set flags), 64, 164 ADDI (add immediate and set flags), 64 ADDI (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 Advanced Encryption, 488 Antidependence, 348 Addition, 1819 significands, 211 Algol-60, OL2.22-7 Apple iPad 2 A1395, 20 operands, 189 AGP, B-9 logic board of, 20 significands, 211 Aliasing, 458, 459 Address interleaving, 395 Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 AND operation, 49 AND operation, 91 AND operation, | •                                                                                                                                                                                                                                 |                                                                                                                                        |                                                                                                                                                                                                                                     |
| hardware/software interface, 22 LEGv8 modes, 120–121 fallacy, 572 principle, 22 PC-relative, 118, 120 and (AND), 64 to simplify design, 11 register, 120 AND gates, A-600, C-7 Accumulator architectures, OL2.22-2 x86 modes, 158 AND operation, 91 Acronyms, 9 Addressing modes AND operation, A-594 Active matrix, 18 desktop architectures, D-6 andi (And Immediate), 65 ADD (add), 64 ADDS (add and set flags), 64, 164 andu (And Immediate), 64 ADDI (add immediate and set flags), 64 Advanced Encryption Standard (AES) Addition, 188–191. See also Arithmetic binary, 188–189 Advanced Vector Extensions (AVX), 232, floating-point, 212–215, 220 240 Agple computer, OL1.12-7 speed, 191 Aliasing, 458, 459 Application binary interface (ABI), 22 Address interleaving, 395 Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | · ·                                                                                                                                                                                                                               | *                                                                                                                                      | •                                                                                                                                                                                                                                   |
| principle, 22 PC-relative, 118, 120 and (AND), 64 to simplify design, 11 register, 120 AND gates, A-600, C-7 Accumulator architectures, OL2.22-2 x86 modes, 158 AND operation, 91 Acronyms, 9 Addressing modes AND operation, A-594 Active matrix, 18 desktop architectures, D-6 andi (And Immediate), 65 ADD (add), 64 ADDS (add and set flags), 64, 164 ADDIS (add immediate and set flags), 64 Advanced Encryption Standard (AES) Addition, 188–191. See also Arithmetic binary, 188–189 Advanced Vector Extensions (AVX), 232, floating-point, 212–215, 220 240 Apple iPad 2 A1395, 20 operands, 189 significands, 211 Algol-60, OL2.22-7 processor integrated circuit of, 21 speed, 191 Aliasing, 458, 459 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                                                                                                                                                                                                                   |                                                                                                                                        |                                                                                                                                                                                                                                     |
| to simplify design, 11  Accumulator architectures, OL2.22-2  Accumulator architectures, OL2.22-2  Accomyms, 9  Active matrix, 18  ADD (add), 64  ADDI (add immediate), 64  ADDIS (add immediate and set flags), 64  Addition, 188–191. See also Arithmetic binary, 188–189  floating-point, 212–215, 220  operands, 189  significands, 211  speed, 191  Address interleaving, 395  Address interleaving, 395  Address select logic, C-24, C-25  Address ing modes  Addressing modes  Addressing modes  AND operation, 91  AND operation, 4-594  andi (And Immediate), 65  Annual failure rate (AFR), 432  versus MTTF of disks, 433–434  Antidependence, 348  Antifuse, A-666  Apple computer, OL1.12-7  Apple iPad 2 A1395, 20  logic board of, 20  processor integrated circuit of, 21  Application programming interfaces (APIs)  Address select logic, C-24, C-25  All-pairs N-body algorithm, B-65  AND gates, A-600, C-7  AND operation, 91  AND operation, 4-594  andi (And Immediate), 65  Annual failure rate (AFR), 432  versus MTTF of disks, 433–434  Antidependence, 348  Antifuse, A-666  Apple computer, OL1.12-7  Apple iPad 2 A1395, 20  logic board of, 20  processor integrated circuit of, 21  Application programming interfaces (APIs)  Address select logic, C-24, C-25  All-pairs N-body algorithm, B-65                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | hardware/software interface, 22                                                                                                                                                                                                   |                                                                                                                                        | •                                                                                                                                                                                                                                   |
| Accumulator architectures, OL2.22-2 Acronyms, 9 Active matrix, 18 ADD (add), 64 ADDI (add immediate), 64 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address select logic, C-24, C-25 Active matrix, 18 Addressing modes Addressing modes AND operation, 91 AND operation, 4-594 andi (And Immediate), 65 Annual failure rate (AFR), 432 Antual failur | principle, 22                                                                                                                                                                                                                     | PC-relative, 118, 120                                                                                                                  | and (AND), 64                                                                                                                                                                                                                       |
| Acronyms, 9 Active matrix, 18 Active matrix, 18 ADD (add), 64 ADDS (add and set flags), 64, 164 ADDIS (add immediate), 64 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 ADDS (add and set flags), 64 Advanced Encryption Standard (AES) encryption Standard (AES) encryption, 488 Antifuse, A-666 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 ADD (add immediate), 65 Annual failure rate (AFR), 432 Antual failure rate (AFR), 42 Antual failure rate (AFR) Antual failure rate (AFR), 432 Antual | to simplify design, 11                                                                                                                                                                                                            | register, 120                                                                                                                          | AND gates, A-600, C-7                                                                                                                                                                                                               |
| Active matrix, 18 ADD (add), 64 ADDS (add and set flags), 64, 164 ADDI (add immediate), 64 ADDIS (add immediate), 64 ADDIS (add immediate), 64 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 ADDS (add and set flags), 64 Advanced Encryption Standard (AES) encryption, 488 Antidependence, 348 Antifuse, A-666 Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 Advanced Vector Extensions (AVX), 232, Application programming interfaces (APIs) defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Accumulator architectures, OL2.22-2                                                                                                                                                                                               | x86 modes, 158                                                                                                                         | AND operation, 91                                                                                                                                                                                                                   |
| ADD (add), 64 ADDS (add and set flags), 64, 164 ADDI (add immediate), 64 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 ADDS (add and set flags), 64, 164 addu (Add Unsigned), 64 Advanced Encryption Standard (AES) encryption, 488 Antidependence, 348 Antifuse, A-666 Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 Advanced Encryption Standard (AES) Antidependence, 348 Antidependence, 348 Antifuse, A-666 Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application binary interface (ABI), 22 Application programming interfaces (APIs) defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Acronyms, 9                                                                                                                                                                                                                       | Addressing modes                                                                                                                       | AND operation, A-594                                                                                                                                                                                                                |
| ADDI (add immediate), 64 ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 Addu (Add Unsigned), 64 Advanced Encryption Standard (AES) encryption, 488 Antifuse, A-666 Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application programming interfaces (APIs) defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Active matrix, 18                                                                                                                                                                                                                 | desktop architectures, D-6                                                                                                             | andi (And Immediate), 65                                                                                                                                                                                                            |
| ADDIS (add immediate and set flags), 64 Addition, 188–191. See also Arithmetic binary, 188–189 floating-point, 212–215, 220 operands, 189 significands, 211 speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 Advanced Encryption Standard (AES) encryption, 488 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 Advanced Encryption Standard (AES) Antidependence, 348 Antifuse, A-666 Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application binary interface (ABI), 22 Application programming interfaces (APIs) Address select logic, C-24, C-25                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | ADD (add), 64                                                                                                                                                                                                                     | ADDS (add and set flags), 64, 164                                                                                                      | Annual failure rate (AFR), 432                                                                                                                                                                                                      |
| Addition, 188–191. See also Arithmetic binary, 188–189 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7 floating-point, 212–215, 220 240 Apple iPad 2 A1395, 20 operands, 189 AGP, B-9 logic board of, 20 significands, 211 Algol-60, OL2.22-7 processor integrated circuit of, 21 speed, 191 Aliasing, 458, 459 Application binary interface (ABI), 22 Address interleaving, 395 Alignment restriction, 71 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | ADDI (add immediate), 64                                                                                                                                                                                                          | addu (Add Unsigned), 64                                                                                                                | versus MTTF of disks, 433-434                                                                                                                                                                                                       |
| binary, 188–189 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7  floating-point, 212–215, 220 operands, 189 Significands, 211 Speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application binary interface (ABI), 22 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 Application programming interfaces (APIs) Adefined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | ADDIS (add immediate and set flags), 64                                                                                                                                                                                           | Advanced Encryption Standard (AES)                                                                                                     | Antidependence, 348                                                                                                                                                                                                                 |
| binary, 188–189 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7  floating-point, 212–215, 220 operands, 189 Significands, 211 Speed, 191 Address interleaving, 395 Address select logic, C-24, C-25 Advanced Vector Extensions (AVX), 232, Apple computer, OL1.12-7 Apple iPad 2 A1395, 20 logic board of, 20 processor integrated circuit of, 21 Application binary interface (ABI), 22 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 Application programming interfaces (APIs) Adefined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Addition, 188–191. See also Arithmetic                                                                                                                                                                                            | encryption, 488                                                                                                                        | Antifuse, A-666                                                                                                                                                                                                                     |
| operands, 189 AGP, B-9 logic board of, 20 significands, 211 Algol-60, OL2.22-7 processor integrated circuit of, 21 speed, 191 Aliasing, 458, 459 Application binary interface (ABI), 22 Address interleaving, 395 Alignment restriction, 71 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | binary, 188-189                                                                                                                                                                                                                   | Advanced Vector Extensions (AVX), 232,                                                                                                 | Apple computer, OL1.12-7                                                                                                                                                                                                            |
| significands, 211 Algol-60, OL2.22-7 processor integrated circuit of, 21 speed, 191 Aliasing, 458, 459 Application binary interface (ABI), 22 Address interleaving, 395 Alignment restriction, 71 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | floating-point, 212-215, 220                                                                                                                                                                                                      | 240                                                                                                                                    | Apple iPad 2 A1395, 20                                                                                                                                                                                                              |
| speed, 191 Aliasing, 458, 459 Application binary interface (ABI), 22 Address interleaving, 395 Alignment restriction, 71 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | operands, 189                                                                                                                                                                                                                     | AGP, B-9                                                                                                                               | logic board of, 20                                                                                                                                                                                                                  |
| Address interleaving, 395 Alignment restriction, 71 Application programming interfaces (APIs) Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | significands, 211                                                                                                                                                                                                                 | Algol-60, OL2.22-7                                                                                                                     | processor integrated circuit of, 21                                                                                                                                                                                                 |
| Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | speed, 191                                                                                                                                                                                                                        | Aliasing, 458, 459                                                                                                                     | Application binary interface (ABI), 22                                                                                                                                                                                              |
| Address select logic, C-24, C-25 All-pairs N-body algorithm, B-65 defined, B-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | -                                                                                                                                                                                                                                 | Alignment restriction, 71                                                                                                              | Application programming interfaces (APIs)                                                                                                                                                                                           |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                   | =                                                                                                                                      |                                                                                                                                                                                                                                     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                                                                                                   |                                                                                                                                        | graphics, B-14                                                                                                                                                                                                                      |

| Architectural registers, 358          | Arrays, 429                                         | Benchmarks, 554-556                                  |
|---------------------------------------|-----------------------------------------------------|------------------------------------------------------|
| Arithmetic, 186–248                   | logic elements, A-606-607                           | defined, 46                                          |
| addition, 188-191                     | multiple dimension, 226                             | Linpack, 554, OL3.12-4                               |
| addition and subtraction, 188-191     | pointers versus, 146–150                            | multicores, 538-545                                  |
| division, 197-204                     | procedures for setting to zero, 147                 | multiprocessor, 554-556                              |
| fallacies and pitfalls, 242-245       | ASCII                                               | NAS parallel, 556                                    |
| floating-point, 205-230               | binary numbers versus, 111                          | parallel, 555                                        |
| historical perspective, 248           | character representation, 110                       | PARSEC suite, 556                                    |
| multiplication, 191–197               | defined, 110                                        | SPEC CPU, 46-48                                      |
| parallelism and, 230–232              | symbols, 113                                        | SPEC power, 48–49                                    |
| Streaming SIMD Extensions and         | Assemblers, 129–131                                 | SPECrate, 554                                        |
| advanced vector extensions in x86,    | defined, 14                                         | Stream, 564                                          |
| 232–233                               | function, 129                                       | Biased notation, 82, 209                             |
| subtraction, 188–191                  | microcode, C-30                                     | Big-endian byte order, 70                            |
| subword parallelism, 230–232          | number acceptance, 130                              | Binary numbers, 83                                   |
| subword parallelism and matrix        | object file, 130                                    | ASCII versus, 111                                    |
| multiply, 238–242                     | Assembly language, 15                               | conversion to decimal numbers, 78                    |
| Arithmetic instructions. See also     | defined, 14, 129                                    | defined, 75                                          |
| Instructions                          | floating-point, 221                                 | Bisection bandwidth, 551                             |
| desktop RISC, D-11                    | illustrated, 15                                     | Bit maps                                             |
| embedded RISC, D-14                   | LEGv8, 64, 86                                       | defined, 18, 73                                      |
| logical, 263                          |                                                     | goal, 18                                             |
| operands, 67–74                       | programs, 129<br>translating into machine language, |                                                      |
| Arithmetic intensity, 557             | 86                                                  | storing, 18 Bit-Interleaved Parity (RAID 3), OL5.11- |
| Arithmetic logic unit (ALU). See also | Asserted signals, 262, A-592                        | 5                                                    |
| ALU control; Control units            | Associativity                                       | Bits                                                 |
| 1-bit, A-614–617                      | in caches, 419                                      |                                                      |
|                                       |                                                     | ALUOp, 272, 273                                      |
| 64-bit, A-617–626                     | degree, increasing, 418, 466 increasing, 423        | defined, 14                                          |
| before forwarding, 321                | e                                                   | dirty, 452                                           |
| branch datapath, 266                  | set, tag size versus, 423                           | guard, 228                                           |
| hardware, 190                         | Atomic compare and swap, 127                        | patterns, 228–229                                    |
| memory-reference instruction use, 257 | Atomic exchange, 126                                | reference, 450                                       |
| for register values, 264              | Atomic fetch-and-increment, 127                     | rounding, 228                                        |
| R-format operations, 265              | Atomic memory operation, B-21                       | sign, 77                                             |
| signed-immediate input, 323           | Attribute interpolation, B-43–44                    | state, C-8                                           |
| ARM Cortex-A53, 256, 355–358          | Automobiles, computer application in, 4             | sticky, 228                                          |
| address translation for, 483          | Average memory access time (AMAT),                  | valid, 397                                           |
| caches in, 484                        | 416                                                 | Blocking assignment, A-612                           |
| data cache miss rates for, 485        | calculating, 416                                    | Blocking factor, 428                                 |
| memory hierarchies of, 482            | <b>D</b>                                            | Block-Interleaved Parity (RAID 4),                   |
| performance of, 485–488               | В                                                   | OL5.11-5-5.11-6                                      |
| specification, 356                    | D 1 111 00                                          | Blocks                                               |
| TLB hardware for, 483                 | Bandwidth, 30                                       | combinational, A-592                                 |
| ARM instructions, 152–154             | bisection, 551                                      | defined, 390                                         |
| 12-bit immediate field, 153           | external to DRAM, 412                               | finding, 467                                         |
| brief history, OL2.22-5               | memory, 394–395, 412                                | flexible placement, 416–418                          |
| condition field, 334                  | network, 549                                        | least recently used (LRU), 423                       |
| unique, D-36–37                       | Barrier synchronization, B-18                       | locating in cache, 421–422                           |
| ARMv7, 62                             | defined, B-20                                       | miss rate and, 405                                   |
| ARMv8, 62, 163–169                    | for thread communication, B-34                      | multiword, mapping addresses to, 404                 |
| common features between MIPS and,     | Base addressing, 69, 120                            | placement locations, 466                             |
| 152                                   | Base registers, 70                                  | placement strategies, 418                            |
| ARPANET, OL1.12-10                    | Basic block, 96                                     | replacement selection, 423                           |

| replacement strategies, 468                     | Bytes                                          | block replacement on, 468                     |
|-------------------------------------------------|------------------------------------------------|-----------------------------------------------|
| spatial locality exploitation, 405              | addressing, 70                                 | capacity, 470, 471                            |
| state, A-592                                    | order, 70                                      | compulsory, 470                               |
| valid data, 400                                 |                                                | conflict, 470                                 |
| Bonding, 28                                     | C                                              | defined, 406                                  |
| Boolean algebra, A-594                          |                                                | direct-mapped cache, 418                      |
| Bounds check shortcut, 98                       | C.mmp, OL6.15-4                                | fully associative cache, 420                  |
| Branch address, 168                             | C language                                     | handling, 406–407                             |
| Branch datapath                                 | assignment, compiling into LEGv8, 66           | memory-stall clock cycles, 413                |
| ALU, 266                                        | compiling, 150, OL2.15-2-2.15-3                | reducing with flexible block placement,       |
| operations, 266                                 | compiling assignment with registers,           | 416-418                                       |
| Branch delay slots                              | 68                                             | set-associative cache, 419                    |
| Branch instructions                             | compiling while loops in, 95-96                | steps, 407                                    |
| pipeline impact, 329                            | sort algorithms, 146                           | in write-through cache, 407                   |
| Branch not taken                                | translation hierarchy, 128                     | Cache performance, 412–431                    |
| assumption, 328–329                             | translation to LEGv8 assembly                  | calculating, 414                              |
| defined, 266                                    | language, 66                                   | hit time and, 415–416                         |
| Branch prediction                               | variables, 106                                 | impact on processor performance, 414          |
| as control hazard solution, 295                 | C++ language, OL2.15-27, OL2.22-8              | Cache-aware instructions, 496                 |
| buffers, 331, 333                               | Cache blocking and matrix multiply,            | Caches, 397–412. See also Blocks              |
| defined, 294                                    | 489–490                                        | accessing, 400–403                            |
| dynamic, 295, 331–334                           | Cache coherence, 477–481                       | in ARM cortex-A53, 484                        |
| static, 345                                     | coherence, 477                                 | associativity in, 419–420                     |
| Branch predictors                               | consistency, 477                               | bits in, 404                                  |
| accuracy, 333                                   | enforcement schemes, 479                       | bits needed for, 404                          |
| correlation, 333                                | implementation techniques, OL5.12-             | contents illustration, 401                    |
| information from, 333                           | 5-5.12-12                                      | defined, 21, 397–398                          |
| tournament, 334                                 | migration, 479                                 | direct-mapped, 398, 399, 404, 416             |
| Branch register, 168                            | problem, 477, 478, 481                         | empty, 400–401                                |
| Branch table, 169                               | protocol example, OL5.12-12-5.12-16            | FSM for controlling, 472                      |
| Branch taken                                    | protocols, 479                                 | fully associative, 417                        |
| cost reduction, 330                             | replication, 479                               | GPU, B-38                                     |
| defined, 266                                    | snooping protocol, 479–481                     | inconsistent, 407                             |
| Branch target                                   | snoopy, OL5.12-16-5.12-17                      | index, 402                                    |
| addresses, 266<br>buffers, 333                  | state diagram, OL5.12-16                       | in Intel Core i7, 484                         |
| Branches. See also Conditional branches         | Cache coherency protocol, OL5.12-              | Intrinsity FastMATH example,<br>409–412       |
|                                                 | 12-5.12-16                                     |                                               |
| addressing in, 117–120<br>compiler creation, 94 | finite-state transition diagram, OL5.12-<br>15 | locating blocks in, 421–422<br>locations, 399 |
| decision, moving up, 330                        |                                                | multilevel, 412, 424                          |
| delayed, 295, 330–331, 295                      | functioning, OL5.12-14<br>mechanism, OL5.12-14 | nonblocking, 483                              |
| ending, 96                                      | state diagram, OL5.12-14                       | physically addressed, 458, 459                |
| execution in ID stage, 330                      | states, OL5.12-13                              | physically indexed, 458                       |
| pipelined, 330                                  | write-back cache, OL5.12-15                    | physically tagged, 458                        |
| target address, 330                             | Cache controllers, 482                         | primary, 424, 431                             |
| unconditional, 318                              | coherent cache implementation                  | secondary, 424, 431                           |
| Branch-on-zero instruction, 280                 | techniques, OL5.12-5–5.12-12                   | set-associative, 417                          |
| B-type instruction format, 113                  | implementing, OL5.12-2                         | simulating, 491                               |
| Bubble Sort, 145                                | snoopy cache coherence, OL5.12-                | size, 403                                     |
| Bubbles, 326                                    | 16–5.12-17                                     | split, 411                                    |
| Bus-based coherent multiprocessors,             | SystemVerilog, OL5.12-2                        | summary, 411–412                              |
| OL6.15-7                                        | Cache hits, 458                                | tag field, 402                                |
| Buses, A-607                                    | Cache misses                                   | tags, OL5.12-3, OL5.12-11                     |
|                                                 |                                                |                                               |

| Caches (Continued)                       | memory-stall, 413                       | Commercial computer development,     |
|------------------------------------------|-----------------------------------------|--------------------------------------|
| virtual memory and TLB integration,      | number of registers and, 67             | OL1.12-4-1.12-10                     |
| 457–459                                  | worst-case delay and, 283               | Commit units                         |
| virtually addressed, 458                 | Clock cycles per instruction (CPI), 35, | buffer, 350                          |
| virtually indexed, 458                   | 293                                     | defined, 350                         |
| virtually tagged, 458                    | one level of caching, 424               | in update control, 355               |
| write-back, 408, 409, 469                | two levels of caching, 424              | Common case fast, 11                 |
| write-through, 407, 409, 469             | Clock rate                              | Common subexpression elimination,    |
| writes, 407–409                          | defined, 33                             | OL2.15-6                             |
| Callee, 101, 103                         | frequency switched as function of, 41   | Communication, 23-24                 |
| Caller, 101                              | power and, 40                           | overhead, reducing, 44-45            |
| Capabilities, OL5.17-8                   | Clocking methodology, 261–263, A-636    | thread, B-34                         |
| Capacity misses, 470                     | edge-triggered, 261, A-636, A-661       | Compact code, OL2.22-4               |
| Carry lookahead, A-626-635               | level-sensitive, A-662, A-663-664       | Compare and branch zero, 330         |
| 4-bit ALUs using, A-633                  | for predictability, 261                 | Comparisons                          |
| adder, A-627                             | Clocks, A-636-638                       | constant operands in, 73             |
| fast, with first level of abstraction,   | edge, A-636, A-638                      | signed <i>versus</i> unsigned, 97    |
| A-627-628                                | in edge-triggered design, A-661         | Compilers, 129                       |
| fast, with "infinite" hardware,          | skew, A-662                             | branch creation, 95                  |
| A-626-627                                | specification, A-645                    | brief history, OL2.22-8-2.22-9       |
| fast, with second level of abstraction,  | synchronous system, A-636–637           | conservative, OL2.15-7               |
| A-628-634                                | Cloud computing, 549                    | defined, 14                          |
| plumbing analogy, A-630, A-631           | defined, 7                              | front end, OL2.15-3                  |
| ripple carry speed <i>versus</i> , A-634 | Cluster networking, 553-554, OL6.9-12   | function, 14, 129                    |
| summary, A-634–635                       | Clusters, OL6.15-8-6.15-9               | high-level optimizations,            |
| Carry save adders, 197                   | defined, 516, 546, OL6.15-8             | OL2.15-4                             |
| Cause register                           | isolation, 547                          | ILP exploitation, OL4.16-5           |
| CDC 6600, OL1.12-7, OL4.16-3             | organization, 515                       | Just In Time (JIT), 137              |
| Cell phones, 7                           | scientific computing on, OL6.15-8       | optimization, 146, OL2.22-9          |
| Central processor unit (CPU). See also   | Cm*, OL6.15-4                           | speculation, 344–345                 |
| Processors                               | CMOS (complementary metal oxide         | structure, OL2.15-2                  |
| classic performance equation,            | semiconductor), 41                      | Compiling                            |
| 36-40                                    | Coarse-grained multithreading, 530      | C assignment statements, 66          |
| defined, 19                              | Cobol, OL2.22-7                         | C language, 95, 150, OL2.15-2-2.15-3 |
| execution time, 32, 33-34                | Code generation, OL2.15-13              | floating-point programs, 222-225     |
| performance, 33–35                       | Code motion, OL2.15-7                   | if-then-else, 94                     |
| system, time, 32                         | Cold-start miss, 470                    | in Java, OL2.15-19                   |
| time, 413                                | Collision misses, 470                   | procedures, 102, 104-105             |
| time measurements, 33-34                 | Column major order, 427                 | recursive procedures, 104-105        |
| user, time, 32                           | Combinational blocks, A-592             | while loops, 95-96                   |
| Cg pixel shader program, B-15-17         | Combinational control units, C-4-8      | Compressed sparse row (CSR) matrix,  |
| Characters                               | Combinational elements, 260             | B-55, B-56                           |
| ASCII representation, 110                | Combinational logic, 261, A-591,        | Compulsory misses, 470, 471          |
| in Java, 113                             | A-597-608                               | Computer architects, 11–12           |
| Chips, 19, 25, 26                        | arrays, A-606–607                       | abstraction to simplify design, 11   |
| manufacturing process, 26                | decoders, A-597                         | common case fast, 11                 |
| Classes                                  | defined, A-593                          | dependability via redundancy, 12     |
| defined, OL2.15-15                       | don't cares, A-605–606                  | hierarchy of memories, 12            |
| packages, OL2.15-21                      | multiplexors, A-598                     | Moore's law, 11                      |
| Clear exclusive instruction (CLREX), 488 | ROMs, A-602–604                         | parallelism, 12                      |
| Clock cycles                             | two-level, A-599-602                    | pipelining, 12                       |
| defined, 33                              | Verilog, A-611–14                       | prediction, 12                       |

| Computers                                 | Control functions                       | Coprocessors                              |
|-------------------------------------------|-----------------------------------------|-------------------------------------------|
| application classes, traditional, 5-6     | ALU, mapping to gates, C-4-7            | defined, 226                              |
| applications, 4                           | defining, 276                           | Core LEGv8 instruction set, 248. See also |
| arithmetic for, 186–248                   | PLA, implementation, C-7,               | MIPS                                      |
| characteristics, OL1.12-12                | C-20-21                                 | abstract view, 258                        |
| commercial development, OL1.12-           | ROM, encoding, C-18-19                  | desktop RISC, D-9-11                      |
| 4-1.12-10                                 | for single-cycle implementation, 281    | implementation, 256-260                   |
| component organization, 17                | Control hazards, 292-295, 328-329       | implementation illustration, 259          |
| components, 17, 177                       | branch delay reduction, 330             | overview, 257-260                         |
| design measure, 53                        | branch not taken assumption, 328        | subset, 256                               |
| desktop, 5                                | branch prediction as solution, 295      | Cores                                     |
| embedded, 5                               | delayed decision approach, 295          | defined, 43                               |
| first, OL1.12-2-1.12-4                    | dynamic branch prediction, 331          | number per chip, 43                       |
| in information revolution, 4              | logic implementation in Verilog,        | Correlation predictor, 333                |
| instruction representation, 82-89         | OL4.13-8                                | Cosmic Cube, OL6.15-7                     |
| performance measurement,                  | pipeline stalls as solution, 293        | CPU, 9                                    |
| OL1.12-10                                 | pipeline summary, 335–336               | Cray computers, OL3.12-5-3.12-6           |
| post-PC era, 6–7                          | simplicity, 328                         | Critical word first, 406                  |
| principles, 86                            | solutions, 293                          | Crossbar networks, 551                    |
| servers, 5                                | static multiple-issue processors and,   | CTSS (Compatible Time-Sharing             |
| Condition codes/flags, 97                 | 345-346                                 | System), OL5.18-9                         |
| Conditional branches                      | Control lines                           | CUDA programming environment, 539,        |
| changing program counter with, 333        | asserted, 276                           | B-5                                       |
| compiling if-then-else into, 94           | in datapath, 275                        | barrier synchronization, B-18, B-34       |
| defined, 93                               | execution/address calculation, 312      | development, B-17, B-18                   |
| desktop RISC, D-16                        | final three stages, 314                 | hierarchy of thread groups, B-18          |
| embedded RISC, D-16                       | instruction decode/register file read,  | kernels, B-19, B-24                       |
| implementation, 99                        | 312                                     | key abstractions, B-18                    |
| in loops, 119                             | instruction fetch, 312                  | paradigm, B-19–23                         |
| PA-RISC, D-34, D-35                       | memory access, 312                      | parallel plus-scan template, B-61         |
| PC-relative addressing, 118               | setting of, 276                         | per-block shared memory, B-58             |
| RISC, D-10–16                             | values, 312                             | plus-reduction implementation, B-63       |
| SPARC, D-10–12                            | write-back, 312                         | programs, B-6, B-24                       |
| Conditional move instructions, 334        | Control signals                         | scalable parallel programming with,       |
| Conflict misses, 470                      | ALUOp, 275                              | B-17-23                                   |
| Constant memory, B-40                     | defined, 262                            | shared memories, B-18                     |
| Constant operands, 73–74                  | effect of, 276                          | threads, B-36                             |
| frequent occurrence, 73                   | multi-bit, 276                          | Cyclic redundancy check, 437              |
| Content Addressable Memory (CAM), 422     | pipelined datapaths with, 311–315       | Cylinder, 396                             |
| Context switch, 460                       | truth tables, C-14                      | -,,                                       |
| Control                                   | Control units, 259. See also Arithmetic | D                                         |
| ALU, 271–273                              | logic unit (ALU)                        |                                           |
| challenge, 336                            | address select logic, C-24, C-25        | D flip-flops, A-639, A-641                |
| finishing, 281                            | combinational, implementing, C-4–8      | D latches, A-639, A-640                   |
| forwarding, 320                           | with explicit counter, C-23             | Data bits, 435                            |
| FSM, C-8–21                               | illustrated, 277                        | Data flow analysis, OL2.15-11             |
| implementation, optimizing, C-27–28       | logic equations, C-11                   | Data hazards, 289–292, 316–328.           |
| mapping to hardware, C-2–32               | main, designing, 273–276                | See also Hazards                          |
| memory, C-26                              | as microcode, C-28                      | forwarding, 289, 316–328                  |
| organizing, to reduce logic, C-31–32      | MIPS, C-10                              | load-use, 290, 329                        |
| pipelined, 311–315                        | next-state outputs, C-10, C-12–13       | stalls and, 324–328                       |
| Control flow graphs, OL2.15-9–2.15-10     | output, 271–273, C-10                   | Data parallel problem decomposition,      |
| illustrated examples, OL2.15-9, OL2.15-10 | Cooperative thread arrays (CTAs), B-30  | B-17, B-18                                |
|                                           | - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 | 2 1,,2 10                                 |

I-6 Index

| Data race, 125                            | Denormalized numbers, 230             | Direct memory access (DMA), OL6.9-4      |
|-------------------------------------------|---------------------------------------|------------------------------------------|
| Data selectors, 258                       | Dependability via redundancy, 12      | Direct3D, B-13                           |
| Data transfer instructions.               | Dependable memory hierarchy, 432–437  | Direct-mapped caches. See also Caches    |
| See also Instructions                     | failure, defining, 432                | address portions, 421                    |
| defined, 68, 69                           | Dependences                           | choice of, 422                           |
| load, 69                                  | between pipeline registers, 319       | defined, 398, 416                        |
| offset, 70                                | between pipeline registers and ALU    | illustrated, 399                         |
| store, 71–72                              | inputs, 319                           | memory block location, 417               |
| Datacenters, 7                            | bubble insertion and, 326             | misses, 419                              |
| Data-level parallelism, 524               | detection, 318                        | single comparator, 421                   |
| Datapath elements                         | name, 348                             | total number of bits, 404                |
| defined, 263                              | sequence, 316                         | Dirty bit, 452                           |
| sharing, 268                              | Design                                | Dirty pages,                             |
| Datapaths                                 | compromises and, 85                   | Disk memory, 395–397                     |
| branch, 266                               | datapath, 263                         | Displacement addressing, 120             |
| building, 263-271                         | digital, 366                          | Distributed Block-Interleaved Parity     |
| control signal truth tables, C-14         | logic, 260–263, B-1–79                | (RAID 5), OL5.11-6                       |
| control unit, 277                         | main control unit, 273–276            | Divide algorithm, 200                    |
| defined, 19                               | memory hierarchy, challenges, 472     | Dividend, 198                            |
| design, 263                               | pipelining instruction sets, 288      | Division, 197–203                        |
| exception handling, 339                   | Desktop and server RISCs. See also    | algorithm, 199                           |
| for fetching instructions, 265            | Reduced instruction set computer      | dividend, 198                            |
| for hazard resolution via forwarding, 323 | (RISC) architectures                  | divisor, 198                             |
| for LEGv8 architecture, 269               | addressing modes, D-6                 | Divisor, 198                             |
| for memory instructions, 267              | architecture summary, D-4             | divu (Divide Unsigned). See also         |
| in operation for branch-on-zero           | arithmetic/logical instructions, D-11 | Arithmetic                               |
| instruction, 280                          | conditional branches, D-16            | faster, 202–203                          |
| in operation for load instruction, 279    | constant extension summary, D-9       | floating-point, 220                      |
| in operation for R-type instruction,      | control instructions, D-11            | hardware, 198–201                        |
| 277, 278                                  | conventions equivalent to MIPS core,  | hardware, improved version, 201          |
| operation of, 276–280                     | D-12                                  | in LEGv8, 203                            |
| pipelined, 297–315                        | data transfer instructions, D-10      | operands, 198                            |
| for R-type instructions, 278,             | features added to, D-45               | quotient, 198                            |
| 276–277                                   | floating-point instructions, D-12     | remainder, 198                           |
| single, creating, 267                     | instruction formats, D-7              | signed, 201–202                          |
| single-cycle, 296                         | multimedia extensions, D-16–18        | SRT, 203                                 |
| static two-issue, 347                     | multimedia support, D-18              | Don't cares, A-605–606                   |
| Deasserted signals, 262, A-592            | types of, D-3                         | example, A-605–606                       |
| DEC PDP-8, OL2.22-3                       | Desktop computers, defined, 5         | term, 273                                |
| Decimal numbers                           | Device driver, OL6.9-5                | Double data rate (DDR), 393              |
| binary number conversion to, 78           | DGEMM (Double precision General       | Double Data Rate (DDR) SDRAM,            |
| defined, 75                               | Matrix Multiply), 238, 363, 365, 427, | 393–394, A-653                           |
| Decision-making instructions, 93–99       | 553                                   | Double precision. <i>See also</i> Single |
| Decoders, A-597                           | cache blocked version of, 429         | precision                                |
| two-level, A-653                          | optimized C version of, 241, 363, 490 | defined, 207                             |
| Decoding machine language, 121–125        | performance, 365, 430                 | FMA, B-45–46                             |
| Defect, 26                                | Dicing, 27                            | GPU, B-45–46, B-74                       |
| Delayed branches, 295. See also Branches  | Dies, 26, 26–27                       | representation, 206–207                  |
| as control hazard solution, 295           | Digital design pipeline, 366          | Doubleword, 66, 158                      |
| embedded RISCs and, D-23                  | Digital signal-processing (DSP)       | Dual inline memory modules (DIMMs), 395  |
| for five-stage pipelines, 323–324         | extensions, D-19                      | Dynamic branch prediction, 331–334.      |
| reducing, 330                             | DIMMs (dual inline memory modules),   | See also Control hazards                 |
| Delayed decision, 295                     | OL5.17-5                              | branch prediction buffer, 331            |
| DeMorgan's theorems, A-599                | Direct Data IO (DDIO), OL6.9-6        | loops and, 333                           |

| Dynamic hardware predictors, 295                | Elements                                           | Exception program counters (EPCs), 326    |
|-------------------------------------------------|----------------------------------------------------|-------------------------------------------|
| Dynamic multiple-issue processors, 343,         | combinational, 260                                 | address capture, 331                      |
| 349–352. See also Multiple issue                | datapath, 263, 268                                 | copying, 181                              |
| pipeline scheduling, 350–352                    | memory, A-638-646                                  | defined, 181, 327                         |
| superscalar, 349                                | state, 260, 262, 264, A-636, A-638                 | in restart determination, 326-327         |
| Dynamic pipeline scheduling, 350–352            | Embedded computers, 5                              | transferring, 182                         |
| commit unit, 350                                | application requirements, 6                        | Exception Syndrome Register (ESR), 337,   |
| concept, 350                                    | design, 5                                          | 461                                       |
| hardware-based speculation, 352                 | growth, OL1.12-12-1.12-13                          | Exceptions, 336-342                       |
| primary units, 351                              | Embedded Microprocessor Benchmark                  | association, 342                          |
| reorder buffer, 355                             | Consortium (EEMBC), OL1.12-12                      | datapath with controls for handling,      |
| reservation station, 350                        | Embedded RISCs. See also Reduced                   | 339                                       |
| Dynamic random access memory                    | instruction set computer (RISC)                    | defined, 207, 336                         |
| (DRAM), 392, 393–395, A-651–653                 | architectures                                      | detecting, 336                            |
| bandwidth external to, 412                      | addressing modes, D-6                              | event types and, 336                      |
| cost, 23                                        | architecture summary, D-4                          | imprecise, 342                            |
| defined, 19, A-651                              | arithmetic/logical instructions, D-14              | interrupts <i>versus</i> , 336            |
| DIMM, OL5.17-5                                  | conditional branches, D-16                         | in LEGv8 architecture, 337–338            |
| Double Date Rate (DDR), 393-394                 | constant extension summary, D-9                    | overflow, 339                             |
| early board, OL5.17-4                           | control instructions, D-15                         | pipelined computer example, 339           |
| GPU, B-37–38                                    | data transfer instructions, D-13                   | in pipelined implementation, 338–342      |
| growth of capacity, 25                          | delayed branch and, D-23                           | precise, 342                              |
| history, OL5.17-2                               | DSP extensions, D-19                               | reasons for, 337–338                      |
| internal organization of, 394                   | general purpose registers, D-5                     | result due to overflow in add             |
| pass transistor, A-651                          | instruction conventions, D-15                      | instruction, 341                          |
| SIMM, OL5.17-5, OL5.17-6                        | instruction formats, D-8                           | saving/restoring stage on, 462            |
| single-transistor, A-652                        | multiply-accumulate approaches, D-19               | Executable files                          |
| size, 412                                       | types of, D-4                                      | defined, 131                              |
| speed, 23                                       | Encoding                                           | Execute or address calculation stage, 303 |
| synchronous (SDRAM), 393–394,                   | defined, C-31                                      | Execute/address calculation               |
| A-648, A-653                                    | LEGv8 instruction, 86, 122                         | control line, 312                         |
| two-level decoder, A-653                        | ROM control function, C-18–19                      | load instruction, 303                     |
|                                                 |                                                    |                                           |
| Dynamically linked libraries (DLLs),<br>134–136 | ROM logic function, A-603 x86 instruction, 161–162 | store instruction, 303 Execution time     |
| defined, 134                                    |                                                    |                                           |
|                                                 | ENIAC (Electronic Numerical Integrator             | as valid performance measure, 51          |
| lazy procedure linkage version, 135             | and Calculator), OL1.12-2, OL1.12-                 | CPU, 32, 33–34                            |
| =                                               | 3, OL5.17-2                                        | pipelining and, 297                       |
| E                                               | EPIC, OL4.16-5                                     | Explicit counters, C-23, C-26             |
| T. 1                                            | Error correction, A-653–655                        | Exponents, 206                            |
| Early restart, 406                              | Error Detecting and Correcting Code                | Extended-register instructions, 164       |
| Edge-triggered clocking methodology,            | (RAID 2), OL5.11-5                                 | -                                         |
| 261, 262, A-636, A-661                          | Error detection, A-654                             | F                                         |
| advantage, A-637                                | Error detection code, 434                          | P.1. 1                                    |
| clocks, A-661                                   | Ethernet, 23                                       | Failures, synchronizer, A-665             |
| drawbacks, A-662                                | EX stage                                           | Fallacies. See also Pitfalls              |
| illustrated, A-638                              | load instructions, 303                             | add immediate unsigned, 227               |
| rising edge/falling edge, A-636                 | overflow exception detection, 338, 341             | Amdahl's law, 572                         |
| EDSAC (Electronic Delay Storage                 | store instructions, 305                            | arithmetic, 242–245                       |
| Automatic Calculator), OL1.12-3,                | Exabyte, 6                                         | assembly language for performance,        |
| OL5.17-2                                        | Exception enable, 461                              | 169                                       |
| Eispack, OL3.12-4                               | Exception link register (ELR), 337, 459, 461       | commercial binary compatibility           |
| Electrically erasable programmable              | address capture, 340                               | importance, 170                           |
| read-only memory (EEPROM),                      | defined, 338                                       | defined, 49                               |
| 395                                             | in restart determination, 337                      | GPUs, B-72-74, B-75                       |

| Fallacies ( <i>Continued</i> ) low utilization uses little power, 50 | Flip-flops<br>D flip-flops, A-639, A-641  | Floating-point instructions desktop RISC, D-12 |
|----------------------------------------------------------------------|-------------------------------------------|------------------------------------------------|
| peak performance, 572                                                | defined, A-639                            | SPARC, D-31                                    |
| pipelining, 366                                                      | Floating point, 205-230, 232              | Floating-point multiplication, 215-219         |
| powerful instructions mean higher                                    | assembly language, 221                    | binary, 219                                    |
| performance, 169                                                     | backward step, OL3.12-4-3.12-5            | illustrated, 218                               |
| right shift, 242                                                     | binary to decimal conversion, 211         | instructions, 220                              |
| False sharing, 480                                                   | branch, 220                               | significands, 215                              |
| Fast carry                                                           | challenges, 246                           | steps, 215, 217                                |
| with "infinite" hardware, A-626-627                                  | diversity <i>versus</i> portability,      | Flow-sensitive information,                    |
| with first level of abstraction,                                     | OL3.12-3-3.12-4                           | OL2.15-15                                      |
| A-627–628                                                            | division, 220                             | Flushing instructions, 329, 330                |
| with second level of abstraction,                                    | first dispute, OL3.12-2–3.12-3            | exceptions and, 340                            |
| A-628–634                                                            | form, 206                                 | For loops, 147, OL2.15-26                      |
| Fast Fourier Transforms (FFT), B-53                                  | fused multiply add, 228                   | inner, OL2.15-24                               |
| Fault avoidance, 433                                                 | guard digits, 226–227                     | SIMD and, OL6.15-2                             |
|                                                                      | history, OL3.12-3                         | Format fields, C-31                            |
| Fault forecasting, 433                                               | IEEE 754 standard, 207–211                |                                                |
| Fault tolerance, 433                                                 |                                           | Fortran, OL2.22-7                              |
| Fermi architecture, 539, 568                                         | intermediate calculations, 226            | Forwarding, 316–328                            |
| Field programmable devices (FPDs),                                   | LEGv8 instruction frequency for, 248      | ALU before, 321                                |
| A-666                                                                | LEGv8 instructions, 220–226               | control, 320                                   |
| Field programmable gate arrays (FPGAs),                              | machine language, 221                     | datapath for hazard resolution, 323            |
| A-666                                                                | operands, 221                             | defined, 289                                   |
| Fields                                                               | overflow, 206                             | functioning, 317                               |
| defined, 84                                                          | packed format, 232                        | graphical representation, 290                  |
| format, C-31                                                         | precision, 243                            | illustrations, OL4.13-26                       |
| LEGv8, 84–86                                                         | procedure with two-dimensional            | multiple results and, 292                      |
| names, 84                                                            | matrices, 223–225                         | multiplexors, 322                              |
| Files, register, 264, 269, A-638,                                    | programs, compiling, 222–225              | pipeline registers before, 321                 |
| A-642-644                                                            | registers, 226                            | with two instructions, 289–290                 |
| Fine-grained multithreading, 530                                     | representation, 206–211                   | Verilog implementation,                        |
| Finite-state machines (FSMs), 472–477,                               | rounding, 226                             | OL4.13-2-4.13-4                                |
| A-655–660                                                            | sign and magnitude, 206                   | Fractions, 206, 207                            |
| control, C-8–22                                                      | SSE2 architecture, 232, 233               | Frame buffer, 18                               |
| controllers, 475                                                     | subtraction, 220                          | Frame pointers, 106                            |
| for multicycle control, C-9                                          | underflow, 206                            | Front end, OL2.15-3                            |
| for simple cache controller, 476-477                                 | units, 227                                | Fully associative caches. See also Caches      |
| implementation, 474, A-658                                           | in x86, 233                               | block replacement strategies,                  |
| Mealy, 475                                                           | Floating vectors, OL3.12-3                | 468-469                                        |
| Moore, 475                                                           | Floating-point addition, 212–215          | choice of, 422                                 |
| next-state function, 474, A-655                                      | arithmetic unit block diagram, 216        | defined, 417                                   |
| output function, A-655, A-657                                        | binary, 213                               | memory block location, 417                     |
| state assignment, A-658                                              | illustrated, 214                          | misses, 420                                    |
| state register implementation, A-659                                 | instructions, 220                         | Fully connected networks, 551                  |
| style of, 475                                                        | steps, 212                                | Fused-multiply-add (FMA) operation,            |
| synchronous, A-655                                                   | Floating-point arithmetic (GPUs), B-41-46 | 228, B-45-46                                   |
| SystemVerilog, OL5.12-7                                              | basic, B-42                               |                                                |
| traffic light example, A-656–658                                     | double precision, B-45-46, B-74           | G                                              |
| Flash memory, 395                                                    | performance, B-44                         |                                                |
| characteristics, 23                                                  | specialized, B-42-44                      | Galois/Counter Mode ( GCM )                    |
| defined, 23                                                          | supported formats, B-42                   | encryption, 488                                |
| Flat address space, 493                                              | texture operations, B-44                  | Game consoles, B-9                             |

| Gates, A-591, A-596<br>AND, A-600, C-7 | General Purpose (GPGPUs), B-5 graphics mode, B-6 | Hardware multithreading, 530–533 coarse-grained, 530 |
|----------------------------------------|--------------------------------------------------|------------------------------------------------------|
| delays, A-634–635                      | graphics trends, B-4                             | options, 531                                         |
| mapping ALU control function to,       | history, B-3–4                                   | simultaneous, 531                                    |
| C-4-7                                  | logical graphics pipeline, B-13–14               | Hardware-based speculation, 352                      |
| NAND, A-596                            | mapping applications to, B-55–72                 | Harvard architecture, OL1.12-4                       |
| NOR, A-596, A-638                      | memory, 538                                      | Hazard detection units, 324                          |
| Gather-scatter, 527, 568               | multilevel caches and, 538                       | functions, 324                                       |
| General Purpose GPUs (GPGPUs),         | N-body applications, B-65–72                     | pipeline connections for, 327                        |
| B-5                                    | NVIDIA architecture, 539–541                     | Hazards. See also Pipelining                         |
| General-purpose registers, 154         | parallel memory system, B-36–41                  | control, 292–293, 328–336                            |
| architectures, OL2.22-3                | parallelism, 539, B-76                           | data, 289, 316–328                                   |
| embedded RISCs, D-5                    | performance doubling, B-4                        | forwarding and, 323                                  |
| Generate                               | perspective, 543–545                             | structural, 288–289, 305                             |
| defined, A-628                         | programming, B-12–24                             | Heap                                                 |
| example, A-632                         | programming interfaces to, B-17                  | allocating space on, 107–110                         |
| super, A-629                           | real-time graphics, B-13                         | defined, 107                                         |
| Gigabyte, 6                            | summary, B-76                                    | Heterogeneous systems, B-4–5                         |
| Global common subexpression            | Graphics shader programs, B-14–15                | architecture, B-7–9                                  |
| elimination, OL2.15-6                  | Gresham's Law, 248, OL3.12-2                     | defined, B-3                                         |
| Global memory, B-21, B-39              | Grid computing, 549                              | Hexadecimal numbers, 83                              |
| Global miss rates, 430                 | Grids, B-19                                      | binary number conversion to,                         |
| Global optimization, OL2.15-5          | GTX 280, 564–569                                 | 83, 84                                               |
| code, OL2.15-7                         | Guard digits                                     | Hierarchy of memories, 12                            |
| implementing, OL2.15-8-2.15-11         | defined, 226                                     | High-level languages, 14–16                          |
| Global pointers, 106                   | rounding with, 227                               | benefits, 16                                         |
| GPU computing. See also Graphics       |                                                  | computer architectures, OL2.22-5                     |
| processing units (GPUs)                | Н                                                | importance, 16                                       |
| defined, B-5                           |                                                  | High-level optimizations,                            |
| visual applications, B-6–7             | Half precision, B-42                             | OL2.15-4-2.15-5                                      |
| GPU system architectures, B-7–12       | Halfwords, 114                                   | Hit rate, 390                                        |
| graphics logical pipeline, B-10        | Hamming, Richard, 434                            | Hit time                                             |
| heterogeneous, B-7-9                   | Hamming distance, 434                            | cache performance and, 415–416                       |
| implications for, B-24                 | Hamming Error Correction Code (ECC),             | defined, 390                                         |
| interfaces and drivers, B-9            | 434–435                                          | Hit under miss, 483                                  |
| unified, B-10–12                       | calculating, 434-435                             | Hold time, A-642                                     |
| Graph coloring, OL2.15-12              | Hard disks                                       | Horizontal microcode, C-32                           |
| Graphics displays                      | access times, 23                                 | Hot-swapping, OL5.11-7                               |
| computer hardware support, 18          | defined, 23                                      | Human genome project, 4                              |
| LCD, 18                                | Hardware                                         |                                                      |
| Graphics logical pipeline, B-10        | as hierarchical layer, 13                        | I                                                    |
| Graphics processing units (GPUs),      | language of, 14–16                               |                                                      |
| 538-543. See also GPU computing        | operations, 63–67                                | I/O, OL6.9-2, OL6.9-3                                |
| as accelerators, 538                   | supporting procedures in, 100-110                | on system performance, OL5.11-2                      |
| attribute interpolation, B-43-44       | synthesis, A-609                                 | I/O benchmarks. See Benchmarks                       |
| defined, 46, 522, B-3                  | translating microprograms to, C-28-32            | IBM 360/85, OL5.17-7                                 |
| evolution, B-5                         | virtualizable, 440                               | IBM 701, OL1.12-5                                    |
| fallacies and pitfalls, B-72-75        | Hardware description languages. See also         | IBM 7030, OL4.16-2                                   |
| floating-point arithmetic, B-17,       | Verilog                                          | IBM ALOG, OL3.12-7                                   |
| B-41-46, B-74                          | defined, A-608                                   | IBM Blue Gene, OL6.15-9-6.15-10                      |
| GeForce 8-series generation, B-5       | using, A-608–614                                 | IBM Personal Computer, OL1.12-7,                     |
| general computation, B-73-74           | VHDL, A-608–609                                  | OL2.22-6                                             |

| IBM System/360 computers, OL1.12-6,      | defined, 83                               | fetching, 265                           |
|------------------------------------------|-------------------------------------------|-----------------------------------------|
| OL3.12-6, OL4.16-2                       | desktop/server RISC architectures, D-7    | fields, 83                              |
| IBM z/VM, OL5.17-8                       | embedded RISC architectures, D-8          | floating-point (x86), 232, 233          |
| ID stage                                 | I-type, 85                                | floating-point, 220–221                 |
| branch execution in, 330, 331            | LEGv8, 151                                | flushing, 329, 330, 340                 |
| load instructions, 303                   | MIPS, 151                                 | immediate, 73                           |
| store instruction in, 302                | R-type, 85, 273                           | introduction to, 62–63                  |
| IEEE 754 floating-point standard, 207-   | x86, 161                                  | jump                                    |
| 211, 208, OL3.12-8-3.12-10. See also     | Instruction latency, 367                  | left-to-right flow, 298                 |
| Floating point                           | Instruction mix, 39, OL1.12-10            | load, 69                                |
| first chips, OL3.12-8-3.12-9             | Instruction set architecture              | logical operations, 90-93               |
| in GPU arithmetic, B-42-43               | ARM, 152–154                              | M32R, D-40                              |
| implementation, OL3.12-10                | branch address calculation, 266           | memory access, B-33-34                  |
| rounding modes, 227                      | defined, 22, 52                           | memory-reference, 257                   |
| today, OL3.12-10                         | history, 173–174                          | multiplication, 197                     |
| If statements, 118                       | maintaining, 52                           | nop, 325–326                            |
| I-format, 87                             | protection and, 441                       | PA-RISC, D-34–36                        |
| If-then-else, 94                         | thread, B-31-34                           | performance, 35-36                      |
| Immediate addressing, 120                | virtual machine support, 440-441          | pipeline sequence, 325                  |
| Immediate instructions, 73               | Instruction sets, B-49                    | PowerPC, D-12-13, D-32-34               |
| Imprecise interrupts, 342, OL4.16-4      | ARMv8, 171                                | PTX, B-31, B-32                         |
| Index-out-of-bounds check, 98            | design for pipelining, 228                | representation in computer, 82-89       |
| Induction variable elimination, OL2.15-7 | LEGv8, 247                                | restartable, 462                        |
| Inheritance, OL2.15-15                   | MIPS-32, 151                              | resuming,                               |
| In-order commit, 351                     | x86 growth, 170                           | R-type, 263, 268                        |
| Input devices, 16                        | Instruction-level parallelism (ILP), 365. | SPARC, D-29-32                          |
| Inputs, 273                              | See also Parallelism                      | store, 72                               |
| Instances, OL2.15-15                     | compiler exploitation, OL4.16-5-4.16-6    | store exclusive register (STXR), 126    |
| Instruction count, 36, 38                | defined, 43, 344                          | subtraction, 190                        |
| Instruction decode/register file read    | exploitation, increasing, 354             | SuperH, D-39-40                         |
| stage                                    | and matrix multiply, 363-365              | thread, B-30-31                         |
| control line, 311–312                    | Instructions, 60–174, D-25–27, D-40–42.   | Thumb, D-38                             |
| load instruction, 300                    | See also Arithmetic instructions;         | vector, 524                             |
| store instruction, 305                   | MIPS; Operands                            | as words, 62                            |
| Instruction execution illustrations,     | add immediate, 73                         | x86, 154-159                            |
| OL4.13-16-4.13-17                        | addition, 190                             | Instructions per clock cycle (IPC), 343 |
| clock cycle 9, OL4.13-24                 | Alpha, D-27–29                            | Integrated circuits (ICs), 19. See also |
| clock cycles 1 and 2, OL4.13-21          | arithmetic-logical, 263                   | specific chips                          |
| clock cycles 3 and 4, OL4.13-22          | ARM, 152–154, D-36–37                     | cost, 27                                |
| clock cycles 5 and 6, OL4.13-23          | assembly, 66                              | defined, 25                             |
| clock cycles 7 and 8, OL4.13-24          | basic block, 96                           | manufacturing process, 26               |
| examples, OL4.13-20-4.13-25              | cache-aware, 496                          | very large-scale (VLSIs), 25            |
| forwarding, OL4.13-26-4.13-31            | conditional branch, 93, 94                | Intel Core i7, 46–49, 256, 517, 564–569 |
| no hazard, OL4.13-17                     | conditional move, 334                     | address translation for, 483            |
| pipelines with stalls and forwarding,    | core, 246                                 | architectural registers, 358            |
| OL4.13-26, OL4.13-20                     | data transfer, 68                         | caches in, 484                          |
| Instruction fetch stage                  | decision-making, 93-99                    | memory hierarchies of, 482–488          |
| control line, 312                        | defined, 14, 62                           | microarchitecture, 358                  |
| load instruction, 300                    | desktop RISC conventions, D-12            | performance of, 485–486                 |
| store instruction, 305                   | as electronic signals, 82                 | SPEC CPU benchmark, 46–48               |
| Instruction formats, 161                 | embedded RISC conventions, D-15           | SPEC power benchmark, 48–49             |
| ARMv7, 151                               | encoding, 86                              | TLB hardware for, 483                   |

| Intel Core i7 920, 358–360             | strings in, 113–115                     | architecture, 204                        |
|----------------------------------------|-----------------------------------------|------------------------------------------|
| microarchitecture, 358                 | translation hierarchy, 136              | arithmetic core, 246                     |
| Intel Core i7 960                      | while loop compilation in, OL2.15-      | arithmetic instructions, 63              |
| benchmarking and rooflines of,         | 18-2.15-19                              | arithmetic/logical instructions not in,  |
| 564–569                                | Java Virtual Machine (JVM), 150,        | D-21, D-23                               |
| Intel Core i7 Pipelines, 354, 358–360  | OL2.15-16                               | assembly instruction, mapping, 82–83     |
| memory components, 359                 | Jump instructions, 254, D-26            | common extensions to, D-20-25            |
| performance, 361–362                   | branch instruction <i>versus</i> , 270  | compiling C assignment statements        |
| program performance, 362               | control and datapath for, 271           | into, 66                                 |
| specification, 356                     | implementing, 270                       | compiling complex C assignment into,     |
| Intel IA-64 architecture, OL2.22-3     | instruction format, 270                 | 66                                       |
| Intel Paragon, OL6.15-8                | Just In Time (JIT) compilers, 137, 576  | control instructions not in, D-21        |
| Intel Threading Building Blocks, B-60  |                                         | control registers, 461                   |
| Intel x86 microprocessors              | K                                       | control unit, C-10                       |
| clock rate and power for, 40           |                                         | data transfer instructions not in, D-20, |
| Interference graphs, OL2.15-12         | Karnaugh maps, A-606                    | D-22                                     |
| Interleaving, 412                      | Kernel mode, 459                        | divide in, 203                           |
| Interprocedural analysis, OL2.15-14    | Kernels                                 | exceptions in, 337–338                   |
| Interrupt enable, 461                  | CUDA, B-19, B-24                        | fields, 84–85                            |
| Interrupt-driven I/O, OL6.9-4          | defined, B-19                           | floating-point instructions not in,      |
| Interrupts                             | Kilobyte, 6                             | D-22                                     |
| defined, 207, 336                      | _                                       | floating-point instructions, 220–221     |
| event types and, 336                   | L                                       | instruction classes, 173                 |
| exceptions versus, 336                 |                                         | instruction encoding, 86, 122            |
| imprecise, 342, OL4.16-4               | LAPACK, 243                             | instruction formats, 124, 151            |
| precise, 342                           | Large-scale multiprocessors, OL6.15-7,  | instruction set, 62, 171, 246, 247,      |
| vectored, 337                          | OL6.15-9-6.15-10                        | 256–260, D-9–10                          |
| Intrinsity FastMATH processor, 409–412 | Latches                                 | machine language, 88                     |
| caches, 410                            | D latch, A-639, A-640                   | memory addresses, 71                     |
| data miss rates, 411, 421              | defined, A-639                          | memory allocation for program and        |
| read processing, 456                   | Latency                                 | data, 108                                |
| TLB, 454–457                           | instruction, 367                        | multiply in, 197                         |
| write-through processing, 456          | memory, B-74–75                         | operands, 64                             |
| Inverted page tables, 451              | pipeline, 297                           | Pseudo, 246                              |
| Issue packets, 345                     | use, 346                                | register conventions, 109                |
|                                        | LDUR (load register), 64                | static multiple issue with, 345–347      |
| J                                      | LDURB (load byte), 64                   | Level-sensitive clocking, A-662,         |
| *                                      | LDURH (load half), 64                   | A-663–664                                |
| Java                                   | LDURSW (load signed word), 64           | defined, A-662                           |
| bytecode, 136                          | LDXR (load exclusive register), 64, 122 | two-phase, A-663                         |
| bytecode architecture, OL2.15-17       | Leaf procedures. See also Procedures    | Link, OL6.9-2                            |
| characters in, 113–115                 | defined, 104                            | Linkers, 131–134                         |
| compiling in, OL2.15-19-2.15-20        | example, 113                            | defined, 131                             |
| goals, 136                             | Least recently used (LRU)               | executable files, 131                    |
| interpreting, 136, 150, OL2.15-15      | as block replacement strategy, 468–469  | steps, 131                               |
| keywords, OL2.15-21                    | defined, 423                            | using, 131–134                           |
| method invocation in, OL2.15-21        | pages, 448                              | Linking object files, 132–134            |
| pointers, OL2.15-26                    | Least significant bits, A-620           | Linpack, 554, OL3.12-4                   |
| primitive types, OL2.15-26             | defined, 75                             | Liquid crystal displays (LCDs), 18       |
| programs, starting, 136–137            | SPARC, D-31                             | LISP, SPARC support, D-30                |
| reference types, OL2.15-26             | Left-to-right instruction flow, 298–299 | Livermore Loops OL1 12 11                |
| sort algorithms, 146                   | LEGv8, 62, 64, 86                       | Livermore Loops, OL1.12-11               |

| Load balancing, 521–522                  | sequential, A-593, A-644-646            | Megabyte, 6                                          |
|------------------------------------------|-----------------------------------------|------------------------------------------------------|
| Load byte, 167                           | two-level, A-599–602                    | Memory                                               |
| Load halfword, 167                       | Logical operations, 90–93               | addresses, 79                                        |
| Load instructions. See also Store        | AND, 91                                 | affinity, 562                                        |
| instructions                             | ARM, 154                                | atomic, B-21                                         |
| access, B-41                             | desktop RISC, D-11                      | bandwidth, 394–395, 411                              |
| base register, 274                       | embedded RISC, D-14                     | cache, 21, 397–412, 412–431                          |
| compiling with, 71–72                    | EOR, 92                                 | CAM, 422                                             |
| datapath in operation for, 279           | NOT, 91                                 | constant, B-40                                       |
| defined, 69                              | OR, 91                                  | control, C-26                                        |
| EX stage, 303                            | shifts, 90                              | defined, 19                                          |
| halfword unsigned, 114                   | Long instruction word (LIW), OL4.16-5   | DRAM, 19, 393–394, A-651–653                         |
| ID stage, 302                            | Lookup tables (LUTs), A-667             | flash, 23                                            |
| IF stage, 302                            | Loop unrolling                          | global, B-21, B-39                                   |
| load byte unsigned, 79                   | defined, 348, OL2.15-4                  | GPU, 538                                             |
| load half, 114                           | for multiple-issue pipelines, 348       | instructions, datapath for, 267                      |
| MEM stage, 304                           | register renaming and, 348              | local, B-21, B-40                                    |
| move wide with keep, 115                 | Loops, 95–96                            | main, 23                                             |
| move wide with zeros, 115                | conditional branches in, 118            | nonvolatile, 22                                      |
| pipelined datapath in, 307               | for, 147                                | operands, 68–69                                      |
| signed, 79                               | prediction and, 333-334                 | parallel system, B-36–41                             |
| unit for implementing, 267               | test, 147, 148                          | read-only (ROM),A-602–604                            |
| unsigned, 79                             | while, compiling, 95–96                 | SDRAM, 393–394                                       |
| WB stage, 304                            |                                         | secondary, 23                                        |
| Load register, 69, 72                    | M                                       | shared, B-21, B-39-40                                |
| Loaders, 134                             | •••                                     | spaces, B-39                                         |
| Load-store architectures, OL2.22-3       | M32R, D-15, D-40                        | SRAM, A-646-650                                      |
| Load-use data hazard, 290, 329           | Machine code, 83                        | stalls, 414                                          |
| Load-use stalls, 329                     | Machine instructions, 83                | technologies for building, 24-28                     |
| Local area networks (LANs), 24. See also | Machine language, 15                    | texture, B-40                                        |
| Networks                                 | branch offset in, 119                   | virtual, 441–465                                     |
| Local memory, B-21, B-40                 | decoding, 121-124                       | volatile, 22                                         |
| Local miss rates, 430                    | defined, 14, 83                         | Memory access instructions, B-33-34                  |
| Local optimization, OL2.15-5. See also   | floating-point, 221                     | Memory access stage                                  |
| Optimization                             | illustrated, 15                         | control line, 313                                    |
| implementing, OL2.15-8                   | LEGv8, 88                               | load instruction, 303                                |
| Locality                                 | SRAM, 21                                | store instruction, 303                               |
| principle, 388                           | translating MIPS assembly language      | Memory bandwidth, 565, 573                           |
| spatial, 388, 391                        | into, 86                                | Memory consistency model, 481                        |
| temporal, 388, 391                       | Main memory, 442. See also Memory       | Memory elements, A-638-646                           |
| Lock synchronization, 125                | defined, 23                             | clocked, A-639                                       |
| Locks, 534                               | page tables, 451                        | D flip-flop, A-639, A-641                            |
| Logic                                    | physical addresses, 442                 | D latch, A-640                                       |
| address select, C-24, C-25               | Mapping applications, B-55-72           | DRAMs, A-651-655                                     |
| ALU control, C-6                         | Mark computers, OL1.12-14               | flip-flop, A-639                                     |
| combinational, 262, A-593, A-597-608     | Matrix multiply, 238-242, 569-571       | hold time, A-642                                     |
| components, 261                          | Mealy machine, 475, A-656, A-659, A-660 | latch, A-639                                         |
| control unit equations, C-11             | Mean time to failure (MTTF), 432        | setup time, A-641, A-642                             |
| design, 260-263, B-1-79                  | improving, 433                          | SRAMs, A-646-650                                     |
| equations, A-595                         | versus AFR of disks, 433-434            | unclocked, A-639                                     |
| minimization, A-606                      | · · · · · · · · · · · · · · · · · · ·   |                                                      |
|                                          | Media Access Control (MAC) address,     | Memory hierarchies, 559                              |
| programmable array (PAL), A-666          |                                         | Memory hierarchies, 559<br>of ARM cortex-A8, 482–488 |

| cache, performance, 412-431 common framework, 465-472 defined, 389 design challenges, 472 development, OL5.17-6-5.17-8 exploiting, 380-313 of Intel Core 17, 482-488 level pairs, 390 multiple levels, 389 owerall operation of, 457-458 parallelism and, 477-481, OL5.11-2 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside and the program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside and the program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside and the program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside and the program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside and the program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside fred transfers, D-41 program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 otside fred transfers, D-41 program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 399 structure, 399 structure, 399 structure, 399 structure, 399 structure, 399 otside fred, 151 MPS-60 instructions, D-40-42 PC-relative addressing, D-41 mistructions, D-1, D-25-27 moralized fields, D-41 instructions, D-1, D-25-27 program execution time and, 471 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 dittile fields, D-41 instructions fields, D-41 unitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 39 | block (or line), 390                | as abstract control representation, C-30 | Moore's law, 11, 393, 538, OL6.9-2, |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|------------------------------------------|-------------------------------------|
| common framework, 465-472 defined, 389 design challenges, 472 development, OLS.17-6-5.17-8 styliciting, 380-613 of Intel Core i7, 482-488 level pairs, 390 multiple levels, 389 overall operation of, 457-458 parallelism and, 477-481, OLS.11-2 pitfalls, 491-495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 structure, 389 structure, 389 Structure diagram, 392 variance, 431 virual memory, 441-465 Memory rank, 395 Memory technologies, 392-397 DRAM technology, 392, 393-7 DRAM technology, 392, 393 RSRAM technology, 394, 393-395 flash memory, 395 SRAM technology, 394, 393-395 lash memory, 395 SRAM technology, 394, 394-395 lash memory, 395  |                                     |                                          |                                     |
| defined, 389 design challenges, 472 development, OL5,17-6–5,17-8 exploiting, 386–513 of Intel Core; 7, 482–488 level pairs, 390 multiple levels, 389 overall operation of, 457–488 level pairs, 390 multiple levels, 389 overall operation of, 457–481, OL5,11-2 pitifalls, 491–495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 structure diagram, 392 variance, 431 reliance on, 390 virtual memory, 441–465 Memory technologies, 392–397 disk memory, 395–397 DRAM technologies, 392–397 disk memory, 395–397 DRAM technology, 392, 393 Memory-stall clock cycles, 413 Message passing defined, 542 Memory-stall clock cycles, 413 Message passing defined, 512.15-5 invoking in Java, OL2,15-20–2,15-21 invoking in Java, OL2,15-20–2,15-21 invoking in Java, OL2,15-20–2,15-21 invoking in Java, OL2,15-20–3, 38 Microcode defined, OL2,15-5 invoking in Java, OL2,15-20–2,15-21 invoking in J |                                     |                                          | •                                   |
| design challenges, 472 development, O.S.17-6-51.7-8 exploiting, 386-513 of Intel Core 17, 482-488 level pairs, 390 multiple levels, 389 overall operation of, 457-458 parallelism and, 477-481, O.I.5.11-2 pitfalls, 491-495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 structure diagram, 392 variance, 431 virtual memory, 441-465 Memory rank, 395 Memory rank, 395 Memory rank, 395 Memory rank, 395 Memory pank, 395 Memor |                                     |                                          |                                     |
| development, OL5.17-6-5.17-8 exploiting, 386-513 defined, A-600, C-20 in PLA implementation, C-20 MS-DOS, OL5.17-11 Multicore with zero), 64, 115 of Intel Core i7, 482-488 level pairs, 390 multiple levels, 389 overall operation of, 457-458 parallelism and, 477-481, OL5.11-2 pitfalls, 491-495 rorgam execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 structure diagram, 392 variance, 431 virtual memory, 441-465 Memory rank, 395 Memory rank, 395 Memory rank, 395 (DRAM technology, 392, 393-397 disk memory, 395-397 DRAM technology, 392, 393 operations, D-27 conditional procedure call instructions, D-10-27 parallel single precision floating-point operations, D-27 multiprocessors, 543-548 Metastability, A-664 Methods defined, OL2.15-5 invoking in Java, OL2.15-20-2.15-21 miltiprocessors design shift, 517 multiprocessors design shift, 517 multiprocessor design shift, 517 multiproce |                                     | =                                        |                                     |
| exploiting, 386–513 defined, A-600, C-20 MOVZ (move wide with zero), 64, 115 of Intel Core i7, 482–488 in PLA implementation, C-20 MS-DOS, OL5.17-11 Multicore, 533–537 Multicore multiprocessors, 8, 43 defined, 4, 51–14 Multicore, 533–537 Multicore multiprocessors, 8, 43 defined, 8, 121 million, 200 MIPs and ARMv8 MIPs a |                                     |                                          |                                     |
| of Intel Core i7, 482–488 level pairs, 390 multiple levels, 389 overall operation of, 457–458 parallelism and, 477–481, OL5.11-2 pitfalls, 491–495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 mlm by the program execution time and, 471 memory, 481–465 memory rank, 395 structure, 389 structure, 411–465 memory rank, 395 more methonologies, 392–397 more fash memory, 395–397 memory-mapped I/O, OL6.9-3 more possible of fash memory, 395–397 memory-mapped I/O, OL6.9-3 multiprocessors, 543–548 metastability, A-664 metastability, A-664 metastability, A-664 metastability, A-664 metastability, A-664 missing fash memory, 395 more metastability, A-664 multiprocessors, 543–548 metastability, A-664 missing fash memory, 395 more metastability, A-664 multiprocessors, 543–548 multiprocessors, 543–548 more more fash of the processors, 543–548 more more fash of the processors, 543–548 more more fash of the processors, 543–548 more more fash of the processors of the processo |                                     |                                          | _                                   |
| level pairs, 390 multiple levels, 389 overall operation of, 457–458 parallelism and, 477–481, OL5.11-2 pitfalls, 491–495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 structure diagram, 392 variance, 431 virtual memory, 441–465 Memory rank, 395 MEPS-6 Memory technologies, 392–397 disk memory, 395–397 DRAM technology, 392, 393 SRAM technology, 392, 393 Memory-stall clock cycles, 413 Message passing Memory-stall clock cycles, 413 Message passing Memory-stall clock cycles, 413 Message passing Metastability, A-664 Methods defined, OL2.15-5 invoking in Java, OL2.15-20–2.15-21 Microarchitectures, 388 Intel Core in 292, 358 Microcode assembler, C-30 control unit as, G-28 defined, C-27 dispatch ROMs, C-30-31 horizontal, C-32 vertical, C-32 vertical, C-32 vertical, C-32 design shift, 517 multicore, 533–537 MIPS-361 and ARMv8 common features beween, 152 MIPS-61 instruction set, D-41-42 immediate fields, D-41 instruction set, D-41-22 immediate fields, D-41 minstruction shapes, D-42 pC-relative addressing, D-41 MIPS-361 instructions, D-10-22 roonditional procedure call instructions, D-20-27 constant shift amount, D-25 nova to from control registers, D-26 move to fr |                                     | defined, A-600, C-20                     |                                     |
| multiple levels, 389 overall operation of, 457-458 parallelism and, 477-481, OL5.11-2 pitfalls, 491-495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 more and the program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 more disks, 481 reliance on, 390 more disks, 481 reliance, 431 problem of the properties of the propertie | of Intel Core i7, 482–488           | •                                        | MS-DOS, OL5.17-11                   |
| overall operation of, 457–458 parallelism and, 477–481, OL5.11-2 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 structure, 431 virtual memory, 491–465 Memory rank, 395 Memory technologies, 392–397 disk memory, 395–397 DRAM technology, 392, 393–395 flash memory, 395 flash memory, 395 SRAM technology, 392, 393 Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543–548 Metastability, A-664 Methods defined, O12.15-5 invoking in Java, OL2.15-20-2.15-21 Microarchitectures, 258 Intel Core if 2920, 358 Microcode assembler, C-30 control unit as, C-28 defined, C-27 disk menory, 293 defined, C-27 dispatch ROMs, C-30-31 horizontal, C-32 vertical, C-32 design shift, 517 multicore, 8, 43, 517 MIPS-16 MIPS-17 Multivet caches, See also Caches complications, 430 defined, S-17 Multivet caches, See also Caches complications, 430 defined, 412, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, See also Caches complications, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, See also Caches complications, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, See also Caches complications, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, See also Caches complications, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, See also Caches complications, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, 5ee also Caches complications, 430 miss penalty, reducing, 424 summary, 431–432 Multivet caches, 5ee also Caches complications, 430 miss penalty reducing, 424 summary, 431–432 Multiple diac vetasions of the see in the performance of, 424 summary, 431–432 Multiple diac vetasions of the see in the performance of, 424 summary, 431–432 Multiple diac vetasions of the fine defined, 52, 526 Multiple diam | level pairs, 390                    | MIP-map, B-44                            | Multicore, 533–537                  |
| parallelism and, 477–481, OL5.11-2 pitifalls, 491–495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 miltiples and instruction set, 151 mistruction set, 151 m | multiple levels, 389                | MIPS and ARMv8                           | Multicore multiprocessors, 8, 43    |
| pitfalls, 491–495 program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 MIPS-32 instruction set, D-41 MIPS-32 instructions, D-10 MIPS-32 instructio | overall operation of, 457-458       | common features beween, 152              | defined, 8, 517                     |
| program execution time and, 431 quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 structure, 389 MIPS-64 instructions, 151, D-25-27 structure diagram, 392 variance, 431 virtual memory, 441-465 Memory rank, 395 Memory rehnologies, 392-397 disk memory, 395-397 DRAM technology, 392, 393-395 flash memory, 395 SRAM technology, 392, 393 Memory-mapped I/O, Ol.6.9-3 Memory-rapped I/O, Ol.6.9-3 Memory-rapped I/O, Ol.6.9-3 Memory-rapped I/O, Ol.6.9-3 Memory-mapped I/O, Ol.6.9-3 Memory-stable lock cycles, 413 Message passing defined, 543 multiprocessors, 543-548 Mirroring, Ol.5.11-5 invoking in Java, Ol.2.15-20-2.15-21 Microarchitectures, 358 Intel Core i7 920, 358 Microcode assembler, C-30 control unit as, C-28 dispatch ROMs, C-30-31 horizontal, C-32 vertical, C-32 vertical, C-32 vertical, C-32 design shift, 517 multicore, 8, 43, 517  imulticore, 8, 43, 517  Miltiple instruction changes, D-42 chadressing, D-41 mistruction changes, D-42 chadressing, D-41 mistruction changes, D-42 chadressing, D-41 mistructions, chall instructions, defended, D-41 mistruction set, 151 mistructions, 151, D-25-27 mistructions, 151, D-25-27 mistructions, 151 mistruction repaired instructions, 151, D-25-27 mistructions, 152-20 miss our carbitation, 405-406 misspectal extensions of desktop/server RISCs, D-16-18 as SIMD extensions to instruction sets, 016.615-4 vector versus, 525-526 Multiple dimension arrays, 226 Multiple dimension arrays, 226 Multiple dimension arrays, 226 Multiple instruction single data (MISD), 574 defined, 390 defined, 01-2-1 multicore, 10, 10-2-1 multicore, 10, 10-2-1 multicore, 10, 10-2-1 mistructions, 10-2-1 mistructions, 10-2-1 mistructions, 10-2-1 mistructions | parallelism and, 477-481, OL5.11-2  | MIPS-16                                  | MULTICS (Multiplexed Information    |
| quantitative design parameters, 466 redundant arrays and inexpensive disks, 481 reliance on, 390 MIPS-32 instruction set, 151 reliance on, 390 MIPS-64 instructions, 151, D-25-27 structure, 389 MIPS-64 instructions, 151, D-25-27 conditional procedure call instructions, variance, 431 virtual memory, 441-465 Memory rank, 395 Memory technologies, 392-397 disk memory, 392, 393-395 flash memory, 395 SRAM technology, 392, 393 Memory-mapped I/O, Ol.6,9-3 Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543-548 Miterodehod, D12.15-5 invoking in Java, OL2.15-20-2.15-21 Microarchitectures, 358 Intel Core i7 920, 358 Microcode assembler, C-30 defined, C-27 dispatch ROMs, C-30-31 horizontal, C-32 design shift, 517 multicore, 8, 43, 517 Mexign shift, 517 multicore, 8, 43, 517 Mexign shift, 517 multicore, 8, 43, 517 Mexign shift, 517 multicore, 8, 43, 517 MIPS core instruction changes, D-42 pC-relative addressing, D-41 defined, 512, 30 miss penalty, educing, 424 performance of, 424 somiss required data transfers, D-25 constant shift amount, D-25 design shift, 517 Mily defined, 512, D-26 multicont changes, D-42 defined, 312, 30 miss penalty, reducing, 424 performance of, 424 performance of, 424 westerous, 408 miss qualitative design miss penalty, reducing, 424 performance of, 424 performance of, 424 westerous, 408 miss penalty reducing, 424 performance of, 424 westerous, 429 miss penalty reducing, 424 performance of, 424 westerous, 408 ultimedia extensions desktop/server RISCs, D-16-18 as SMD extensions to instruction sets, 0L6.15-4 vector versus, 525-526 Multiple dimension arrays, 226 Multiple dimension arrays, 246 Multiple dimension | pitfalls, 491–495                   | 16-bit instruction set, D-41-42          | and Computing Service), OL5.17-     |
| redundant arrays and inexpensive disks, 481  Reliance on, 390  MIPS-32 instruction set, 151  structure, 389  MIPS-64 instructions, 151, D-25-27  structure diagram, 392  variance, 431  virtual memory, 441-465  Memory rank, 395  Memory rank, 395  Memory technologies, 392-397  disk memory, 395-397  DRAM technology, 392, 393  Memory, 395-397  DRAM technology, 392, 393  Memory-stall clock cycles, 413  Message passing  defined, 543  multiprocessors, 543-548  Mitroring, OL5.11-5  Metastability, A-664  Metstability, A-664  Metstability, A-664  Metstability, A-664  Miss penalty  defined, OL2.15-5  invoking in Java, OL2.15-20-2.15-21  Microachet C-27  dispatch ROMs, C-30  defined, O-27  dispatch ROMs, C-30  Mitroring, OL5.11  Intrinsity FastMATH processor, 411  Microprocessors  Meis vender and resist and respondency and selector control values, 322  multicore, 8, 43, 517  Moore machines, 475, A-656, A-659,  Metastage instruction set, 151  milticore, 8, 43, 517  Microchieve, 431  Miss penalty  defined, 520  defined, O-27  dispatch ROMs, C-30  Multiple control values, 322  defined, 529  Multiple processors, 569-571  Multiple processors, 569-571  Multiple processors, 569-571  Multiple processors, 569-571  Multiple processors, 411  Microprocessors  Miss under miss, 483  |                                     | immediate fields, D-41                   | 9-5.17-10                           |
| defined, 412, 430 reliance on, 390 MIPS-32 instruction set, 151 structure, 389 MIPS-64 instructions, 151, D-25-27 structure diagram, 392 variance, 431 virtual memory, 441-465  Memory rank, 395 Memory technologies, 392-397 disk memory, 395-397 DRAM technology, 392, 393-395 flash memory, 395-397 DRAM technology, 392, 393 Memory-mapped I/O, OL6.9-3 Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543-548 Metastability, A-664 Methods defined, OL2.15-5 invoking in Java, OL2.15-20-2.15-21 Microarchitectures, 358 Intel Core i7 920, 358 Microarch Microarch dispatch ROMs, C-30-31 horizontal, C-32 design shift, 517 multicore, 8, 43, 517 Melos Core in Mircol and instruction set, 151 millicore, 8, 43, 517 MINPS-32 instruction set, 151 millicore, 8, 43, 517 MINPS-32 instruction set, 151 millicore, 8, 43, 517 MINPS-32 instruction set, 151 millicore constant shift amount, D-25 performance of, 424 summary, 431-432 Multimedia extensions desktop/server RISCs, D-16-18 as SIMD extensions to instruction sets, OL6.15-4 wellstansfers, D-25 Multiple dimension arrays, 226 Multiple instruction multiple data (MISD), 574 defined, 523, 524 first multiprocessor, OL6.15-14 Multiple instruction single data (MISD), 523 Multiple instruction single data (MISD), 523 Multiple instruction single data (MISD), 523 Multiple instruction single data (MISD), 524 defined, 523, 524 first multiprocessor, 343, 349-350 issue packets, 345 iopo unrolling and, 348 processors, 343, 344 static, 343, 345-349 Multiple processors, 569-571 Multiple-clock-cycle pipeline diagrams, 308 first acache, 467 dispatch ROMs, C-30-31 horizontal, C-32 miss sources, 471 indictory received in a department of the surface of the processor, 411 horizontal, C-32 multicore, 8, 43, 517 More machines, 475, A-656, A-659, defined, 528 forwarding, control values, 322                                               | quantitative design parameters, 466 | instructions, D-40-42                    | Multilevel caches. See also Caches  |
| reliance on, 390 MIPS-32 instruction set, 151 miss penalty, reducing, 424 structure, 389 MIPS-64 instructions, 151, D-25-27 structure diagram, 392 conditional procedure call instructions, 241 yerformance of, 424 summary, 431—432 summary, 431—43 | redundant arrays and inexpensive    | MIPS core instruction changes, D-42      | complications, 430                  |
| structure, 389 structure diagram, 392 variance, 431 virtual memory, 441–465 Memory technologies, 392–397 DRAM technology, 392, 393–395 flash memory, 395 SRAM technology, 392, 393 Memory technologies, 392–397 DRAM technology, 392, 393 SRAM technology, 392, 393 Memory stall clock cycles, 413 Memory stall clock cycles, 413 Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543–548 Metastability, A-664 Methods defined, OL2.15-5 invoking in Java, OL2.15-20–2.15-21 Microarchitectures, 358 Intel Core ir 292, 358 Microcode assembler, C-30 control unit as, C-28 defined, C-27 dispatch ROMs, C-30–31 horizontal, C-32 vertical, C-32 design shift, 517 Miss under miss, 483 design shift, 517 Miss under miss, 483 design shift, 517 Miss under miss, 483 Miscrocre, 421 Microarchitectures, 431 Miss penalty defined, 390 d | disks, 481                          | PC-relative addressing, D-41             | defined, 412, 430                   |
| structure diagram, 392 variance, 431 variance, 431 virtual memory, 441–465  Memory rank, 395  Memory rank, 396  Memory technologies, 392–397 disk memory, 395–397 DRAM technology, 392, 393–395 flash memory, 395 SRAM technology, 392, 393  Memory-mapped I/O, Ol.6.9-3  Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543–548 Mitoroing, Ol.5.11-5 Methods defined, Ol.2.15-5 invoking in Java, Ol.2.15-20–2.15-21 Microarchitectures, 358 Intel Core if 920, 358 Microcode assembler, C-30 defined, C-27 dispatch ROMs, C-30–31 Microinstructions, C-31 Microinstructions, C-31 Microinstruction, Sp. 27 Miss under miss, 443 Miss pash, 430 Multiple cimeral, 309 Multiple instruction multiple data (MISD), 574 Multiple instruction multiple data (MISD), 522 Multiple instruction multiple data (MISD), 574 Multiple instruction single data (MISD), 574 Multiple instruction single data (MISD), 574 Multiple instruction single data (MISD), 523 Multiple instruction single data (MISD), 524 Multiple instruction single data (MISD), 524 Multiple instruction single data (MISD), 523 Multiple instruction single data (MISD), 52 | reliance on, 390                    | MIPS-32 instruction set, 151             | miss penalty, reducing, 424         |
| variance, 431 virtual memory, 441–465 constant shift amount, D-25 Memory rank, 395 Memory technologies, 392–397 disk memory, 395–397 DRAM technology, 392, 393–395 flash memory, 395 SRAM technology, 392, 393 SRAM technology, 392, 393 Memory rank, 395 Memory technology, 392, 393–395 DRAM technology, 392, 393 SRAM technology, 392, 393 SRAM technology, 392, 393 Memory-stall clock cycles, 413 Memory-stall clock cycles, 413 D-27 Memory-stall clock cycles, 413 D-27 Message passing defined, 543 multiprocessors, 543–548 Mirroring, OL5.11-5 Methods defined, 543 Miss penalty Methods defined, 0L2.15-5 invoking in Java, OL2.15-20–2.15-21 Microarchitectures, 358 Intel Core i7 920, 358 Microcode assembler, C-30 defined, 390 Microarchitectures, 358 Intel Core i7 920, 358 Microorabictectures, 358 Intel Core i7 920, 358 Microorabic unit as, C-28 global, 430 Microorabic unit as, C-28 defined, C-27 improvement, 405–406 improvement, 405–406 dispatch ROMs, C-30–31 Intrinsity FastMATH processor, 411 horizontal, C-32 miss sources, 471 misr multiped acta extensions desktop/server RISCs, D-16–18 Miss under miss, 483 design shift, 517 Mix (MultiMedia eXtension), 232 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                        | structure, 389                      | MIPS-64 instructions, 151, D-25-27       | performance of, 424                 |
| virtual memory, 441-465constant shift amount, D-25desktop/server RISCs, D-16-18Memory rank, 395jump/call not PC-relative, D-26as SIMD extensions to instruction sets,Memory technologies, 392-397move to/from control registers, D-26OL6.15-4disk memory, 395-397nonaligned data transfers, D-25vector versus, 525-526DRAM technology, 392, 393-395NOR, D-25Multiple dimension arrays, 226flash memory, 395parallel single precision floating-point<br>operations, D-27Multiple instruction multiple dataMemory-mapped I/O, OL6.9-3reciprocal and reciprocal square root,<br>defined, 543Miltiple instruction multiple dataMessage passing<br>defined, 543SYSCALL, D-25Multiple instruction single data (MISD),<br>first multiprocessors, Cl.6.15-14Metastability, A-664Miss penaltycode scheduling, 343-350Methodsdefined, 390dynamic, 343, 349-350defined, OL2.15-5<br>invoking in Java, OL2.15-20-2.15-21dissertesjon unrolling and, 348Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core ir 920, 358block size versus, 406static, 343, 345-349Microocdedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569-571ocntrol unit as, C-28<br>defined, C-27<br>dispatch ROMs, C-30-31<br>horizontal, C-32<br>vertical, C-32<br>vertical, C-32Intrinsity FastMATH processor, 411<br>inprovement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 40                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | structure diagram, 392              | conditional procedure call instructions, | summary, 431–432                    |
| Memory rank, 395jump/call not PC-relative, D-26<br>move to/from control registers, D-26<br>disk memory, 395-397as SIMD extensions to instruction sets,<br>OL6.15-4<br>vector versus, 525-526DRAM technology, 392, 393-395<br>flash memory, 395<br>SRAM technology, 392, 393NOR, D-25<br>parallel single precision floating-point<br>operations, D-27<br>reciprocal and reciprocal square root,<br>Memory-stall clock cycles, 413Multiple dimension arrays, 226<br>(MIMD), 574<br>defined, 523, 524<br>first multiprocessor, OL6.15-14Message passing<br>defined, 543<br>multiprocessors, 543-548D-27<br>TLB instructions, D-26-27<br>TLB instructions, D-26-27<br>multiprocessors, 543-548Miroring, OL5.11-5<br>Mirs penalty<br>defined, 012.15-5<br>determination, 405-406<br>determination, 405-406<br>determination, 405-406<br>determination, 405-406<br>invoking in Java, OL2.15-20-2.15-21Multiple issue, 343-350<br>determination, 405-406<br>invoking in Java, OL2.358Joop unrolling and, 348<br>processors, 343, 344<br>static, 343, 345-349Microarchitectures, 358<br>Intel Core i7 920, 358Miss rates<br>block size versus, 406<br>data cache, 467<br>data cache, 467<br>control unit as, C-28<br>defined, C-27<br>dispatch ROMs, C-30-31<br>dispatch ROMs, C-30-31<br>horizontal, C-32<br>vertical, C-32<br>miss sources, 471<br>microinstructions, C-31Multiple-clock-cycle pipeline diagrams, 308<br>five instructions, 309<br>involvalleycors, A-598<br>controls, 473<br>in datapath, 275<br>defined, C-27<br>dispatch ROMs, C-30-31<br>dispatch ROMs, C-30-31<br>horizontal, C-32<br>miss sources, 471<br>microinstructions, C-31Miss under miss, 483<br>design shift, 517<br>Microinstructions, C-31Miss under miss, 483<br>defined, 255, 6-66, A-659,Multiplecors, Control values, 322<br>defined, 257                                                                                                                                                                                                                                                                                                       | variance, 431                       | D-27                                     | Multimedia extensions               |
| Memory rank, 395jump/call not PC-relative, D-26<br>move to/from control registers, D-26<br>disk memory, 395-397as SIMD extensions to instruction sets,<br>OL6.15-4<br>vector versus, 525-526DRAM technology, 392, 393-395<br>flash memory, 395<br>SRAM technology, 392, 393<br>Memory-mapped I/O, OL6.9-3NOR, D-25<br>parallel single precision floating-point<br>operations, D-27<br>reciprocal and reciprocal square root,<br>floatined, 543<br>multiprocessors, 543-548Multiple dimension arrays, 226<br>(MIMD), 574<br>defined, 523, 524<br>first multiprocessors, OL6.15-14Message passing<br>defined, 543<br>multiprocessors, 543-548SYSCALL, D-25<br>TLB instructions, D-26-27<br>TLB instructions, D-26-27<br>TLB instructions, D-26-27<br>multiprocessors, 543-548Multiple instruction single data (MISD),<br>523<br>Mirroring, OL5.11-5Metastability, A-664<br>Methods<br>defined, OL2.15-5<br>invoking in Java, OL2.15-20-2.15-21<br>Microarchitectures, 358<br>Intel Core i7 920, 358Miss rates<br>Miss rates<br>multilevel caches, reducing, 424Multiple issue, 343-350<br>code scheduling, 347-348<br>dynamic, 343, 349-350Microocode<br>assembler, C-30<br>control unit as, C-28<br>defined, G-27<br>dispatch ROMs, C-30-31<br>horizontal, C-32<br>vertical, C-32<br>vertical, C-32<br>miss sources, 471<br>mitrinsity FastMATH processor, 411<br>horizontal, C-32<br>miss sources, 471<br>miss sources, 471<br>miss sources, 475<br>multicore, 8, 43, 517Multiplecors, A-598<br>defined, 250, 64-656, A-659,<br>forwarding, control values, 322<br>defined, C-27<br>defined, C-27<br>defined, C-27<br>dispatch ROMs, C-30-31<br>dispatch ROMs, C-30-31<br>Microinstructions, C-31Miss under miss, 483<br>defined, 250<br>miss sources, 471<br>mitrinsity FastMATH processor, 411<br>in datapath, 275<br>defined, 258<br>defined, 258<br>defined, 258<br>define                                                                                                                                                                                                                                                 | virtual memory, 441-465             | constant shift amount, D-25              | desktop/server RISCs, D-16-18       |
| Memory technologies, 392-397move to/from control registers, D-26OL6.15-4disk memory, 395-397nonaligned data transfers, D-25vector versus, 525-526DRAM technology, 392, 393-395NOR, D-25Multiple dimension arrays, 226flash memory, 395parallel single precision floating-point operations, D-27Multiple instruction multiple dataSRAM technology, 392, 393operations, D-27(MIMD), 574Memory-mapped I/O, OL6.9-3reciprocal and reciprocal square root, defined, 523, 524first multiprocessor, OL6.15-14Message passing defined, 543TLB instructions, D-26-27523multiprocessors, 543-548Mirroring, OL5.11-5Multiple instruction single data (MISD), defined, 543Metastability, A-664Miss penaltycode scheduling, 347-348Methodsdefined, 390dynamic, 343, 349-350defined, OL2.15-5determination, 405-406issue packets, 345invoking in Java, OL2.15-20-2.15-21multilevel caches, reducing, 424loop unrolling and, 348Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core i7 920, 358block size versus, 406static, 343, 345-349Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569-571control unit as, C-28global, 430Multiple processors, 569-571dispatch ROMs, C-30-31Intrinsity FastMATH processor, 411iillustrated, 309horizontal, C-32miss sources, 471controls, 473Microinstructions, C-31Miss under miss, 483defined, 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Memory rank, 395                    | jump/call not PC-relative, D-26          | _                                   |
| disk memory, 395-397nonaligned data transfers, D-25vector versus, 525-526DRAM technology, 392, 393-395NOR, D-25Multiple dimension arrays, 226flash memory, 395parallel single precision floating-point<br>operations, D-27Multiple instruction multiple dataSRAM technology, 392, 393reciprocal and reciprocal square root,<br>defined, DCk, Cycles, 413Multiple instruction first multiprocessor, OL6.15-14Message passing<br>defined, 543SYSCALL, D-25Multiple instruction single data (MISD),<br>first multiprocessor, OL6.15-14Metastability, A-664Misr penaltycode scheduling, 347-348Methodsdefined, 390dynamic, 343, 349-350defined, OL2.15-5<br>invoking in Java, OL2.15-20-2.15-21determination, 405-406<br>multilevel caches, reducing, 424issue packets, 345Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core i7 920, 358block size versus, 406<br>data cache, 467<br>defined, 390static, 343, 345-349Microcode<br>assembler, C-30<br>control unit as, C-28<br>defined, C-27<br>dispatch ROMs, C-30-31<br>horizontal, C-32global, 430<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406Multiple processors, 569-571Microinstructions, C-31Intrinsity FastMATH processor, 411<br>horizontal, C-32illustrated, 309<br>Multiplexors, A-598vertical, C-32miss sources, 471<br>split cache, 411in datapath, 275<br>defined, 258Microprocessors<br>design shift, 517<br>multicore, 8, 43, 517Miss under miss, 483<br>More machines, 475, A-656, A-659,<br>selector control, 271forwarding, control values, 322                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Memory technologies, 392–397        |                                          | OL6.15-4                            |
| flash memory, 395 SRAM technology, 392, 393 operations, D-27 Memory-mapped I/O, OL6.9-3 Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543–548 Mirroring, OL5.11-5 Methods defined, 590 defined, OL2.15-5 invoking in Java, OL2.15-20-2.15-21 Microarchitectures, 358 Miss rates Intel Core i7 920, 358 Microcode data cache, 467 Microarchi ext. Microcode defined, 390 defined, 390 defined, 390 Multiple issue, 343–350 defined, OL2.15-20-2.15-21 Microarchitectures, 358 Miss rates Intel Core i7 920, 358 Miss rates assembler, C-30 control unit as, C-28 defined, 390 defined, 390 defined, 390 Multiple issue, 343–348 Microcode data cache, 467 data cache, 467 control unit as, C-28 defined, 390 Multiple issue, 343–349 Multiple issue, 343–350 defined, OL2.15-0-2.15-21 Multiple issue, 343–350 defined, OL2.15-20-2.15-21 multilevel caches, reducing, 424 loop unrolling and, 348 processors, 343, 344  Microcode data cache, 467 five instructions, 333 defined, C-27 dispatch ROMs, C-30  defined, 390 Multiple processors, 569–571 control unit as, C-28 defined, C-27 dispatch ROMs, C-30–31 Intrinsity FastMATH processor, 411 horizontal, C-32 miss sources, 471 microprocessors design shift, 517 Miss under miss, 483 defined, 258 design shift, 517 MMX (MultiMedia eXtension), 232 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | disk memory, 395-397                |                                          | vector versus, 525-526              |
| flash memory, 395 SRAM technology, 392, 393 operations, D-27 Memory-mapped I/O, OL6.9-3 Memory-stall clock cycles, 413 Message passing defined, 543 multiprocessors, 543–548 Mirroring, OL5.11-5 Methods defined, OL2.15-5 invoking in Java, OL2.15-20-2.15-21 Microarchitectures, 358 Miss rates Intel Core i7 920, 358 Microcode data cache, 467 Microarchitectures, 358 defined, C-27 dispatch ROMs, C-30 assembler, C-30 defined, C-27 dispatch ROMs, C-30–31 horizontal, C-32 miltiprocessors Miss under miss delay Mexit (MIMD), 574 (Mimb), 574 (Mintel, 523, 524 (Multiple instruction multiple data (MISD), 574 (Multiple instruction multiple data (MIMD), 574 (Miltiple instruction multiple data (MIMD), 574 (Multiple instruction multiple data (MIMD), 574 (Multiple instruction single data (MISD), 523 (Multiple instruction single data (MI | DRAM technology, 392, 393-395       | NOR, D-25                                | Multiple dimension arrays, 226      |
| SRAM technology, 392, 393  Memory-mapped I/O, OL6.9-3  Memory-stall clock cycles, 413  D-27  Message passing defined, 543  multiprocessors, 543–548  Mirroring, OL5.11-5  Multiple instruction single data (MISD), defined, 523, 524  Multiple instruction single data (MISD), 523  multiprocessors, 543–548  Mirroring, OL5.11-5  Multiple issue, 343–350  Code scheduling, 347–348  Metastability, A-664  Miss penalty  defined, 390  defined, OL2.15-5  invoking in Java, OL2.15-20–2.15-21  Miss rates  Miss rates  Microarchitectures, 358  Miss rates  Microarchitectures, 358  Miss rates  Microcode  data cache, 467  defined, 390  defined, O2, 358  Microcode  defined, 390  Multiple processors, 343, 344  Intel Core i7 920, 358  Microarchitectures, 358  Miss rates  Microarchitectures, 358  Miss rates  Microcode  data cache, 467  defined, C-30  defined, 390  Multiple processors, 569–571  Control unit as, C-28  defined, C-27  improvement, 405–406  five instructions, 309  dispatch ROMs, C-30–31  Intrinsity FastMATH processor, 411  horizontal, C-32  vertical, C-32  miss sources, 471  Microprocessors  Miss under miss, 483  design shift, 517  MMX (MultiMedia eXtension), 232  multicore, 8, 43, 517  Moore machines, 475, A-656, A-659,  selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                     | parallel single precision floating-point | Multiple instruction multiple data  |
| Memory-mapped I/O, OL6.9-3reciprocal and reciprocal square root,<br>Memory-stall clock cycles, 413defined, 523, 524<br>first multiprocessor, OL6.15-14Message passing<br>defined, 543<br>multiprocessors, 543-548SYSCALL, D-25<br>TLB instructions, D-26-27<br>TLB instructions, D-26-27<br>sultiprocessors, 543-548Mirroring, OL5.11-5<br>Multiple issue, 343-350<br>code scheduling, 347-348Metastability, A-664Miss penalty<br>defined, 390<br>defined, OL2.15-5<br>invoking in Java, OL2.15-20-2.15-21defined, 390<br>multilevel caches, reducing, 424dynamic, 343, 349-350Microarchitectures, 358Miss rates<br>Intel Core i7 920, 358block size versus, 406<br>data cache, 467<br>throughput and, 353static, 343, 345-349Microcode<br>assembler, C-30<br>control unit as, C-28<br>defined, 390<br>defined, 390<br>defined, C-27<br>dispatch ROMs, C-30-31<br>horizontal, C-32<br>vertical, C-32global, 430<br>improvement, 405-406<br>improvement, 405-406<br>improvement, 405-406<br>five instructions, 309<br>illustrated, 309<br>Multiplexors, A-598<br>controls, 473<br>in datapath, 275<br>defined, 252<br>vertical, C-32Miss ources, 471<br>miss sources, 471<br>miss under miss, 483<br>design shift, 517<br>multicore, 8, 43, 517More machines, 475, A-656, A-659,<br>selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                     |                                          |                                     |
| Message passing<br>defined, 543<br>multiprocessors, 543–548SYSCALL, D-25<br>TLB instructions, D-26–27<br>TLB instructions, D-26–27<br>TLB instructions, D-26–27<br>TLB instructions, D-26–27<br>Multiple issue, 343–350Multiple issue, 343–350Metastability, A-664Miss penalty<br>defined, 390<br>defined, OL2.15-5<br>invoking in Java, OL2.15-20–2.15-21defined, 405–406<br>multilevel caches, reducing, 424issue packets, 345Microarchitectures, 358<br>Intel Core i7 920, 358Miss rates<br>block size versus, 406<br>data cache, 467<br>data cache, 467<br>throughput and, 353static, 343, 345–349Microcode<br>assembler, C-30<br>control unit as, C-28<br>defined, 390<br>defined, 390<br>defined, C-27<br>dispatch ROMs, C-30–31<br>horizontal, C-32<br>vertical, C-32<br>vertical, C-32<br>miss sources, 471<br>miss sources, 471<br>Microprocessors<br>Miss under miss, 483<br>design shift, 517<br>multicore, 8, 43, 517Multiple instructions, indet (MISD),<br>Multiple instruction single data (MISD),<br>Multiple instruction single data (MISD),<br>Multiple instruction single data (MISD),<br>Multiple code scheduling, 343–349Microinstructions, 347-348Multiple code scheduling, 347-348Microinstructions, C-30<br>Multiple crocessors, 569–571Multiple crocessor, 411<br>indatapath, 275Microinstructions, C-31split cache, 411<br>in datapath, 275Microinstructions, C-38<br>design shift, 517<br>multicore, 8, 43, 517MMX (MultiMedia eXtension), 232<br>MMX (forwarding, control values, 322<br>selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Memory-mapped I/O, OL6.9-3          | reciprocal and reciprocal square root,   | defined, 523, 524                   |
| Message passing<br>defined, 543<br>multiprocessors, 543–548SYSCALL, D-25<br>TLB instructions, D-26–27<br>TLB instructions, D-26–27<br>TLB instructions, D-26–27<br>TLB instructions, D-26–27<br>Multiple issue, 343–350Multiple issue, 343–350Metastability, A-664Miss penalty<br>defined, 390<br>defined, OL2.15-5<br>invoking in Java, OL2.15-20–2.15-21defined, 405–406<br>multilevel caches, reducing, 424issue packets, 345Microarchitectures, 358<br>Intel Core i7 920, 358Miss rates<br>block size versus, 406<br>data cache, 467<br>data cache, 467<br>throughput and, 353static, 343, 345–349Microcode<br>assembler, C-30<br>control unit as, C-28<br>defined, 390<br>defined, 390<br>defined, C-27<br>dispatch ROMs, C-30–31<br>horizontal, C-32<br>vertical, C-32<br>vertical, C-32<br>miss sources, 471<br>miss sources, 471<br>Microprocessors<br>Miss under miss, 483<br>design shift, 517<br>multicore, 8, 43, 517Multiple instructions, indet (MISD),<br>Multiple instruction single data (MISD),<br>Multiple instruction single data (MISD),<br>Multiple instruction single data (MISD),<br>Multiple code scheduling, 343–349Microinstructions, 347-348Multiple code scheduling, 347-348Microinstructions, C-30<br>Multiple crocessors, 569–571Multiple crocessor, 411<br>indatapath, 275Microinstructions, C-31split cache, 411<br>in datapath, 275Microinstructions, C-38<br>design shift, 517<br>multicore, 8, 43, 517MMX (MultiMedia eXtension), 232<br>MMX (forwarding, control values, 322<br>selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Memory-stall clock cycles, 413      | D-27                                     | first multiprocessor, OL6.15-14     |
| defined, 543TLB instructions, D-26-27523multiprocessors, 543-548Mirroring, OL5.11-5Multiple issue, 343-350Metastability, A-664Miss penaltycode scheduling, 347-348Methodsdefined, 390dynamic, 343, 349-350defined, OL2.15-5determination, 405-406issue packets, 345invoking in Java, OL2.15-20-2.15-21multilevel caches, reducing, 424loop unrolling and, 348Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core i7 920, 358block size versus, 406static, 343, 345-349Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569-571control unit as, C-28global, 430Multiple-clock-cycle pipeline diagrams, 308defined, C-27improvement, 405-406five instructions, 309dispatch ROMs, C-30-31Intrinsity FastMATH processor, 411illustrated, 309horizontal, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                     | SYSCALL, D-25                            |                                     |
| Metastability, A-664Miss penaltycode scheduling, 347–348Methodsdefined, 390dynamic, 343, 349–350defined, OI2.15-5determination, 405–406issue packets, 345invoking in Java, OL2.15-20–2.15-21multilevel caches, reducing, 424loop unrolling and, 348Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core i7 920, 358block size versus, 406static, 343, 345–349Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569–571control unit as, C-28global, 430Multiple-clock-cycle pipeline diagrams, 308defined, C-27improvement, 405–406five instructions, 309dispatch ROMs, C-30–31Intrinsity FastMATH processor, 411illustrated, 309horizontal, C-32local, 430Multiplexors, A-598vertical, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                     | TLB instructions, D-26-27                | 523                                 |
| Metastability, A-664Miss penaltycode scheduling, 347–348Methodsdefined, 390dynamic, 343, 349–350defined, OI2.15-5determination, 405–406issue packets, 345invoking in Java, OL2.15-20–2.15-21multilevel caches, reducing, 424loop unrolling and, 348Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core i7 920, 358block size versus, 406static, 343, 345–349Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569–571control unit as, C-28global, 430Multiple-clock-cycle pipeline diagrams, 308defined, C-27improvement, 405–406five instructions, 309dispatch ROMs, C-30–31Intrinsity FastMATH processor, 411illustrated, 309horizontal, C-32local, 430Multiplexors, A-598vertical, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | multiprocessors, 543-548            | Mirroring, OL5.11-5                      | Multiple issue, 343–350             |
| Methodsdefined, 390dynamic, 343, 349–350defined, OL2.15-5determination, 405–406issue packets, 345invoking in Java, OL2.15-20–2.15-21multilevel caches, reducing, 424loop unrolling and, 348Microarchitectures, 358Miss ratesprocessors, 343, 344Intel Core i7 920, 358block size versus, 406static, 343, 345–349Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569–571control unit as, C-28global, 430Multiple-clock-cycle pipeline diagrams, 308defined, C-27improvement, 405–406five instructions, 309dispatch ROMs, C-30–31Intrinsity FastMATH processor, 411illustrated, 309horizontal, C-32local, 430Multiplexors, A-598vertical, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                     | Miss penalty                             |                                     |
| invoking in Java, OL2.15-20–2.15-21  Microarchitectures, 358  Miss rates  processors, 343, 344  Intel Core i7 920, 358  Microcode  data cache, 467  throughput and, 353  assembler, C-30  control unit as, C-28  defined, C-27  dispatch ROMs, C-30–31  horizontal, C-32  vertical, C-32  Microinstructions, C-31  Microprocessors  Miss under miss, 483  design shift, 517  multilevel caches, reducing, 424  loop unrolling and, 348  processors, 343, 344  static, 343, 345–349  throughput and, 353  Multiple processors, 569–571  Multiple-clock-cycle pipeline diagrams, 308  Multiple-clock-cycle pipeline diagrams, 308  five instructions, 309  Multiplexors, A-598  Multiplexors, A-598  Multiplexors, A-598  controls, 473  in datapath, 275  defined, 258  defined, 258  MMX (MultiMedia eXtension), 232  multicore, 8, 43, 517  Moore machines, 475, A-656, A-659,  selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Methods                             | defined, 390                             | dynamic, 343, 349-350               |
| invoking in Java, OL2.15-20-2.15-21  Microarchitectures, 358  Miss rates  processors, 343, 344  Intel Core i7 920, 358  Microcode  data cache, 467  throughput and, 353  assembler, C-30  control unit as, C-28  defined, C-27  dispatch ROMs, C-30-31  horizontal, C-32  vertical, C-32  Microinstructions, C-31  Microprocessors  Miss under miss, 483  design shift, 517  multilevel caches, reducing, 424  loop unrolling and, 348  processors, 343, 344  static, 343, 345-349  throughput and, 353  Multiple processors, 569-571  Multiple-clock-cycle pipeline diagrams, 308  Multiple-clock-cycle pipeline diagrams, 308  five instructions, 309  Multiplexors, A-598  Multiplexors, A-598  Multiplexors, A-598  controls, 473  in datapath, 275  defined, 258  defined, 258  MMX (MultiMedia eXtension), 232  multicore, 8, 43, 517  Moore machines, 475, A-656, A-659,  selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | defined, OL2.15-5                   | determination, 405-406                   | issue packets, 345                  |
| Microarchitectures, 358  Miss rates  Intel Core i7 920, 358  block size versus, 406  static, 343, 345–349  Microcode  data cache, 467  throughput and, 353  assembler, C-30  control unit as, C-28  defined, 390  Multiple processors, 569–571  control unit as, C-28  defined, C-27  improvement, 405–406  five instructions, 309  dispatch ROMs, C-30–31  Intrinsity FastMATH processor, 411  horizontal, C-32  vertical, C-32  miss sources, 471  Microprocessors  Miss under miss, 483  design shift, 517  MMX (MultiMedia eXtension), 232  multicore, 8, 43, 517  Mioro processors, 343, 344  static, 343, 345–349  throughput and, 353  Multiple processors, 569–571  Multiple-clock-cycle pipeline diagrams, 308  five instructions, 309  five instructions, 349  five  | invoking in Java, OL2.15-20-2.15-21 | multilevel caches, reducing, 424         |                                     |
| Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569–571control unit as, C-28global, 430Multiple-clock-cycle pipeline diagrams, 308defined, C-27improvement, 405–406five instructions, 309dispatch ROMs, C-30–31Intrinsity FastMATH processor, 411illustrated, 309horizontal, C-32local, 430Multiplexors, A-598vertical, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Microarchitectures, 358             | Miss rates                               |                                     |
| Microcodedata cache, 467throughput and, 353assembler, C-30defined, 390Multiple processors, 569–571control unit as, C-28global, 430Multiple-clock-cycle pipeline diagrams, 308defined, C-27improvement, 405–406five instructions, 309dispatch ROMs, C-30–31Intrinsity FastMATH processor, 411illustrated, 309horizontal, C-32local, 430Multiplexors, A-598vertical, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Intel Core i7 920, 358              | block size versus, 406                   | static, 343, 345–349                |
| control unit as, C-28 global, 430 Multiple-clock-cycle pipeline diagrams, 308 defined, C-27 improvement, 405–406 five instructions, 309 dispatch ROMs, C-30–31 Intrinsity FastMATH processor, 411 illustrated, 309 Multiplexors, A-598 vertical, C-32 miss sources, 471 controls, 473 microinstructions, C-31 split cache, 411 in datapath, 275 defined, 258 design shift, 517 MMX (MultiMedia eXtension), 232 forwarding, control values, 322 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                     | data cache, 467                          | throughput and, 353                 |
| control unit as, C-28 global, 430 Multiple-clock-cycle pipeline diagrams, 308 defined, C-27 improvement, 405–406 five instructions, 309 dispatch ROMs, C-30–31 Intrinsity FastMATH processor, 411 illustrated, 309 Multiplexors, A-598 vertical, C-32 miss sources, 471 controls, 473 microinstructions, C-31 split cache, 411 in datapath, 275 defined, 258 design shift, 517 MMX (MultiMedia eXtension), 232 forwarding, control values, 322 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | assembler, C-30                     | defined, 390                             | Multiple processors, 569–571        |
| defined, C-27 improvement, 405–406 five instructions, 309 dispatch ROMs, C-30–31 Intrinsity FastMATH processor, 411 illustrated, 309 horizontal, C-32 local, 430 Multiplexors, A-598 vertical, C-32 miss sources, 471 controls, 473 dispatch ROMs, C-31 split cache, 411 in datapath, 275 defined, 258 design shift, 517 MMX (MultiMedia eXtension), 232 forwarding, control values, 322 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | control unit as, C-28               | global, 430                              |                                     |
| horizontal, C-32 local, 430 Multiplexors, A-598 vertical, C-32 miss sources, 471 controls, 473 Microinstructions, C-31 split cache, 411 in datapath, 275 Microprocessors Miss under miss, 483 defined, 258 design shift, 517 MMX (MultiMedia eXtension), 232 forwarding, control values, 322 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | defined, C-27                       | improvement, 405–406                     |                                     |
| horizontal, C-32 local, 430 Multiplexors, A-598 vertical, C-32 miss sources, 471 controls, 473 Microinstructions, C-31 split cache, 411 in datapath, 275 Microprocessors Miss under miss, 483 defined, 258 design shift, 517 MMX (MultiMedia eXtension), 232 forwarding, control values, 322 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | dispatch ROMs, C-30-31              | Intrinsity FastMATH processor, 411       | illustrated, 309                    |
| vertical, C-32miss sources, 471controls, 473Microinstructions, C-31split cache, 411in datapath, 275MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |                                     | · · ·                                    |                                     |
| MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | vertical, C-32                      | miss sources, 471                        |                                     |
| MicroprocessorsMiss under miss, 483defined, 258design shift, 517MMX (MultiMedia eXtension), 232forwarding, control values, 322multicore, 8, 43, 517Moore machines, 475, A-656, A-659,selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Microinstructions, C-31             |                                          | in datapath, 275                    |
| design shift, 517 MMX (MultiMedia eXtension), 232 forwarding, control values, 322 multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                     |                                          | •                                   |
| multicore, 8, 43, 517 Moore machines, 475, A-656, A-659, selector control, 271                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | •                                   |                                          |                                     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 0                                   |                                          |                                     |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | Microprograms                       |                                          |                                     |

| Multiplicand, 192                    | fine-grained, 530                   | example, C-12-13                        |
|--------------------------------------|-------------------------------------|-----------------------------------------|
| Multiplication, 191–197. See also    | hardware, 530-533                   | implementation, C-12                    |
| Arithmetic                           | simultaneous (SMT), 531-533         | logic equations, C-12-13                |
| fast, hardware, 196                  | Must-information, OL2.15-5          | truth tables, C-15                      |
| faster, 196–197                      | Mutual exclusion, 125               | No Redundancy (RAID 0), OL5.11-4        |
| first algorithm, 194                 |                                     | No write allocation, 408                |
| floating-point, 215-217              | N                                   | Nonblocking assignment, A-612           |
| hardware, 192-194                    |                                     | Nonblocking caches, 355, 483            |
| instructions, 197                    | Name dependence, 348                | Nonuniform memory access (NUMA), 534    |
| in MIPS, 197                         | NAND gates, A-596                   | Nonvolatile memory, 22                  |
| multiplicand, 197                    | NAS (NASA Advanced Supercomputing), | Nops, 326                               |
| multiplier, 197                      | 556                                 | NOR gates, A-596                        |
| operands, 197                        | N-body                              | cross-coupled, A-638                    |
| product, 197                         | all-pairs algorithm, B-65           | D latch implemented with, A-640         |
| sequential version, 192–194          | GPU simulation, B-71                | NOR operation, D-25                     |
| signed, 196                          | mathematics, B-65-67                | NOT operation, 91, A-594                |
| Multiplier, 192                      | multiple threads per body, B-68-69  | Not-A-Number (NaN), 235–236             |
| Multiply algorithm, 195              | optimization, B-67                  | Numbers                                 |
| Multiply-add (MAD), B-42             | performance comparison, B-69–70     | binary, 75                              |
| Multiprocessors                      | results, B-70–72                    | computer versus real-world, 229         |
| benchmarks, 554–556                  | shared memory use, B-67-68          | decimal, 75, 78                         |
| bus-based coherent, OL6.15-7         | Negation shortcut, 79               | denormalized, 230                       |
| defined, 516                         | Nested procedures, 104–105          | hexadecimal, 84                         |
| historical perspective, 577          | compiling recursive procedure       | signed, 75-82                           |
| large-scale, OL6.15-7-6.15-8,        | showing, 104–105                    | unsigned, 75–82                         |
| OL6.15-9-6.15-10                     | NetFPGA 10-Gigagit Ethernet card,   | NVIDIA GeForce 8800, B-46–55            |
| message-passing, 545-550             | OL6.9-2, OL6.9-3                    | all-pairs N-body algorithm, B-71        |
| multithreaded architecture, B-26–27, | Network of Workstations, OL6.15-    | dense linear algebra computations,      |
| B-35–36                              | 8–6.15-9                            | B-51–53                                 |
| organization, 515, 545               | Network topologies, 550–553         | FFT performance, B-53                   |
| for performance, 573                 | implementing, 552                   | instruction set, B-49                   |
| shared memory, 517, 533–537          | multistage, 553                     | performance, B-51                       |
| software, 517                        | Networking, OL6.9-4                 | rasterization, B-50                     |
| TFLOPS, OL6.15-6                     | operating system in, OL6.9-4–6.9-5  | ROP, B-50–51                            |
| UMA, 534                             | performance improvement, OL6.9-     | scalability, B-51                       |
| Multistage networks, 551             | 7–6.9-10                            | sorting performance, B-54–55            |
| Multithreaded multiprocessor         | Networks, 23–24                     | special function approximation          |
| architecture, B-25–36                | advantages, 23                      | statistics, B-43                        |
| conclusion, B-36                     | bandwidth, 549                      | special function unit (SFU), B-50       |
| ISA, B-31-34                         | crossbar, 551                       | streaming multiprocessor (SM), B-48–49  |
| massive multithreading, B-25–26      | fully connected, 551                | streaming processor, B-49–50            |
| multiprocessor, B-26–27              | local area (LANs), 24               | streaming processor array (SPA), B-46   |
| multiprocessor comparison,           | multistage, 551                     | texture/processor cluster (TPC),        |
| B-35–36                              | wide area (WANs), 24                | B-47-48                                 |
| SIMT, B-27–30                        | Newton's iteration, 226             | NVIDIA GPU architecture, 539–541        |
| special function units (SFUs), B-35  | Next state                          | NVIDIA GTX 280, 565, 566                |
| streaming processor (SP), B-34       | nonsequential, C-24                 | NVIDIA Tesla GPU, 564–569               |
| thread instructions, B-30–31         | sequential, C-23                    | 111111111111111111111111111111111111111 |
| threads/thread blocks management,    | Next-state function, 474, A-655     | 0                                       |
| B-30                                 | defined, 474                        | •                                       |
| Multithreading, B-25–26              | implementing, with sequencer,       | Object files, 132                       |
| coarse-grained, 530                  | C-22–28                             | debugging information, 131              |
| defined, 522                         | Next-state outputs, C-10, C-12-13   | header, 130                             |

| linking, 132–134                         | defined, 76, 206                          | surfaces, B-41                        |
|------------------------------------------|-------------------------------------------|---------------------------------------|
| relocation information, 130              | detection, 190                            | texture memory, B-40                  |
| static data segment, 130                 | exceptions, 339                           | Parallel processing programs, 518–523 |
| symbol table, 130                        | floating-point, 207                       | creation difficulty, 518–523          |
| text segment, 130                        | occurrence, 77                            | defined, 516                          |
| Object-oriented languages. See also Java | saturation and, 191                       | for message passing, 533              |
| brief history, OL2.22-8                  | subtraction, 189                          | great debates in, OL6.15-5            |
| defined, 150, OL2.15-5                   | subtraction, 109                          | for shared address space, 533–534     |
| One's complement, 82, A-617              | P                                         | use of, 573                           |
| Opcodes                                  | r                                         | Parallel reduction, B-62              |
| control line setting and, 276            | D. O rodundancy (DAID 6) OL 5 11 7        | Parallel scan, B-60–63                |
| defined, 84, 274                         | P+Q redundancy (RAID 6), OL5.11-7         |                                       |
|                                          | Packed floating-point format, 232         | CUDA template, B-61                   |
| OpenGL, B-13                             | Page faults, 448. See also Virtual memory | inclusive, B-60                       |
| OpenMP (Open MultiProcessing), 536,      | for data access, 463                      | tree-based, B-62                      |
| 556                                      | defined, 442                              | Parallel software, 517                |
| Operands, 67–72. See also Instructions   | handling, 443, 461–464                    | Parallelism, 12, 43, 342–355          |
| 32-bit immediate, 115–116                | virtual address causing, 457, 458         | and computers arithmetic, 230–232     |
| adding, 189                              | Page tables, 468                          | data-level, 246, 524                  |
| arithmetic instructions, 67              | defined, 446                              | debates, OL6.15-5-6.15-7              |
| compiling assignment when in             | illustrated, 449                          | GPUs and, 538, B-76                   |
| memory, 69                               | indexing, 446                             | instruction-level, 43, 342, 354       |
| constant, 73–74                          | inverted, 451                             | memory hierarchies and, 477–481,      |
| division, 197                            | levels, 451                               | OL5.11-2                              |
| floating-point, 221                      | main memory, 451                          | multicore and, 533                    |
| LEGv8, 64                                | register, 446                             | multiple issue, 342–349               |
| memory, 68–69                            | storage reduction techniques, 451         | multithreading and, 531               |
| multiplication, 191                      | updating, 446                             | performance benefits, 44              |
| Operating systems                        | VMM, 463                                  | process-level, 516                    |
| brief history, OL5.17-9-5.17-12          | Pages. See also Virtual memory            | redundant arrays and inexpensive      |
| defined, 13                              | defined, 442                              | disks, 481                            |
| encapsulation, 22                        | dirty, 452                                | subword, D-17                         |
| in networking, OL6.9-4-6.9-5             | finding, 446–447                          | task, B-24                            |
| Operations                               | LRU, 448                                  | task-level, 516                       |
| atomic, implementing, 126                | offset, 443                               | thread, B-22                          |
| hardware, 63–67                          | physical number, 443                      | Paravirtualization, 495               |
| logical, 90–93                           | placing, 432–434                          | PA-RISC, D-14, D-17                   |
| x86 integer, 157, 158                    | size, 444                                 | branch vectored, D-35                 |
| Optimization                             | virtual number, 443                       | conditional branches, D-34, D-35      |
| class explanation, OL2.15-14             | Parallel bus, OL6.9-3                     | debug instructions, D-36              |
| compiler, 146                            | Parallel execution, 125                   | decimal operations, D-35              |
| control implementation,                  | Parallel memory system, B-36-41.          | extract and deposit, D-35             |
| C-27-28                                  | See also Graphics processing units        | instructions, D-34-36                 |
| global, OL2.15-5                         | (GPUs)                                    | load and clear instructions, D-36     |
| high-level, OL2.15-4-2.15-5              | caches, B-38                              | multiply/add and multiply/subtract,   |
| local, OL2.15-5, OL2.15-8                | constant memory, B-40                     | D-36                                  |
| manual, 150                              | DRAM considerations, B-37-38              | nullification, D-34                   |
| OR operation, 91, A-594                  | global memory, B-39                       | nullifying branch option, D-25        |
| Out-of-order execution                   | load/store access, B-41                   | store bytes short, D-36               |
| defined, 351                             | local memory, B-40                        | synthesized multiply and divide,      |
| performance complexity, 430              | memory spaces, B-39                       | D-34-35                               |
| processors, 355                          | MMU, B-38-39                              | Parity, OL5.11-5                      |
| Output devices, 16                       | ROP, B-41                                 | bits, 435                             |
| Overflow                                 | shared memory, B-39-40                    | code, 434, A-653                      |

| PARSEC (Princeton Application         | with control signals, 311-315          | ignoring memory system behavior, 491   |
|---------------------------------------|----------------------------------------|----------------------------------------|
| Repository for Shared Memory          | corrected, 307                         | memory hierarchies, 491–495            |
| Computers), 556                       | illustrated, 300                       | out-of-order processor evaluation, 492 |
| Pass transistor, A-651                | in load instruction stages, 307        | performance equation subset, 50-51     |
| PCI-Express (PCIe), 553, B-8, OL6.9-2 | Pipelined dependencies, 317            | pipelining, 366–367                    |
| PC-relative addressing, 118, 120      | Pipelines                              | pointer to automatic variables, 171    |
| Peak floating-point performance, 558  | branch instruction impact, 329         | sequential word addresses, 171         |
| Pentium bug morality play, 244        | effectiveness, improving, OL4.16-      | simulating cache, 491                  |
| Performance, 28–40                    | 4-4.16-5                               | software development with              |
| assessing, 28                         | execute and address calculation stage, | multiprocessors, 570                   |
| classic CPU equation, 36-40           | 301, 303                               | VMM implementation, 495                |
| components, 38                        | five-stage, 285, 301, 309              | Pixel shader example, B-15-17          |
| CPU, 33–35                            | graphic representation, 290, 307-311   | Pixels, 18                             |
| defining, 29-32                       | instruction decode and register file   | Pointers                               |
| equation, using, 36                   | read stage, 300, 303                   | arrays versus, 146-150                 |
| improving, 34–35                      | instruction fetch stage, 301, 303      | frame, 106                             |
| instruction, 35–36                    | instructions sequence, 325             | global, 106                            |
| measuring, 33-35, OL1.12-10           | latency, 297                           | incrementing, 148                      |
| program, 39–40                        | memory access stage, 301, 303          | Java, OL2.15-26                        |
| ratio, 31                             | multiple-clock-cycle diagrams, 308     | stack, 101, 105                        |
| relative, 31–32                       | performance bottlenecks, 353           | Polling, OL6.9-8                       |
| response time, 30–31                  | single-clock-cycle diagrams, 308       | Pop, 101                               |
| sorting, B-54–55                      | stages, 285                            | Power                                  |
| throughput, 30–31                     | static two-issue, 345                  | clock rate and, 40                     |
| time measurement, 32                  | write-back stage, 301, 305             | critical nature of, 53                 |
| Personal computers (PCs), 7           | Pipelining, 12, 283–297                | efficiency, 354-355                    |
| defined, 5                            | advanced, 354-355                      | relative, 41                           |
| Personal mobile device (PMD)          | benefits, 283                          | PowerPC                                |
| defined, 7                            | control hazards, 292-293               | algebraic right shift, D-33            |
| Petabyte, 6                           | data hazards, 289                      | branch registers, D-32-33              |
| Physical addresses, 442               | exceptions and, 338-342                | condition codes, D-12                  |
| mapping to, 442–443                   | execution time and, 297                | instructions, D-12-13                  |
| space, 533, 535                       | fallacies, 366-367                     | instructions unique to, D-31-33        |
| Physically addressed caches, 458      | hazards, 288                           | load multiple/store multiple, D-33     |
| Pipeline registers                    | instruction set design for, 288        | logical shifted immediate, D-33        |
| before forwarding, 320                | laundry analogy, 284                   | rotate with mask, D-33                 |
| dependences, 319                      | overview, 283-297                      | Precise interrupts, 342                |
| forwarding unit selection, 323        | paradox, 285                           | Prediction, 12                         |
| Pipeline stalls, 291                  | performance improvement, 288           | 2-bit scheme, 333                      |
| avoiding with code reordering, 291    | pitfall, 366–367                       | accuracy, 333                          |
| data hazards and, 324-328             | simultaneous executing instructions,   | dynamic branch, 331-333                |
| insertion, 326                        | 297                                    | loops and, 333-334                     |
| load-use, 329                         | speed-up formula, 285                  | steady-state, 333                      |
| as solution to control hazards, 293   | structural hazards, 288, 305           | Prefetching, 496, 560                  |
| Pipelined branches, 331               | summary, 296                           | Primitive types, OL2.15-26             |
| Pipelined control, 311-315. See also  | throughput and, 297                    | Procedure calls                        |
| Control                               | Pitfalls. See also Fallacies           | preservation across, 106               |
| control lines, 311, 312               | address space extension, 493           | Procedures, 100-110                    |
| overview illustration, 327            | arithmetic, 242-245                    | compiling, 102                         |
| specifying, 312                       | associativity, 492                     | compiling, showing nested procedure    |
| Pipelined datapaths, 297-315          | defined, 49                            | linking, 102–104                       |
| with connected control signals, 315   | GPUs, B-74-75                          | execution steps, 100                   |

| frames, 106                           | Programmable logic devices (PLDs),       | RAM, 9                                  |
|---------------------------------------|------------------------------------------|-----------------------------------------|
| leaf, 104                             | A-666                                    | Raster operation (ROP) processors, B-12 |
| nested, 104-106                       | Programmable ROMs (PROMs), A-602         | B-41, B-50-51                           |
| recursive, 108                        | Programming languages. See also specific | fixed function, B-41                    |
| for setting arrays to zero, 147       | languages                                | Raster refresh buffer, 18               |
| sort, 140–145                         | brief history of, OL2.22-7-2.22-8        | Rasterization, B-50                     |
| strcpy, 112-113                       | object-oriented, 150                     | Ray casting (RC), 568                   |
| string copy, 112–113                  | variables, 67                            | Read-only memories (ROMs),              |
| swap, 138                             | Programs                                 | A-602-604                               |
| Process identifiers, 460              | assembly language, 129                   | control entries, C-16-17                |
| Process-level parallelism, 516        | Java, starting, 136–137                  | control function encoding, C-18-19      |
| Processors, 254–368                   | parallel processing, 516-523             | dispatch, C-25                          |
| as cores, 43                          | starting, 128–137                        | implementation, C-15-19                 |
| control, 19                           | translating, 128–137                     | logic function encoding, A-603          |
| datapath, 19                          | Propagate                                | overhead, C-18                          |
| defined, 17, 19                       | defined, A-628                           | PLAs and, A-603-604                     |
| dynamic multiple-issue, 343           | example, A-632                           | programmable (PROM), A-602              |
| multiple-issue, 343                   | super, A-629                             | total size, C-16                        |
| out-of-order execution, 355, 430      | Protected keywords, OL2.15-21            | Read-stall cycles, 413                  |
| performance growth, 44                | Protection                               | Read-write head, 395                    |
| ROP, B-12, B-41                       | defined, 442                             | Receive message routine, 545            |
| speculation, 344–345                  | implementing, 459-460                    | Recursive procedures, 108. See also     |
| static multiple-issue, 343, 345-349   | mechanisms, OL5.17-9                     | Procedures                              |
| streaming, B-34                       | VMs for, 438                             | clone invocation, 104                   |
| superscalar, 349, 531-532, OL4.16-5   | Protection group, OL5.11-5               | Reduced instruction set computer        |
| technologies for building, 24-28      | Pseudo MIPS                              | (RISC) architectures, D-2-45,           |
| two-issue, 346                        | defined, 246                             | OL2.22-5, OL4.16-4. See also            |
| vector, 523-524                       | instruction set, 248                     | Desktop and server RISCs;               |
| VLIW, 345                             | Pseudoinstructions                       | Embedded RISCs                          |
| Product, 192                          | defined, 129                             | group types, D-3-4                      |
| Product of sums, A-599                | summary, 130                             | instruction set lineage, D-44           |
| Program counters (PCs), 263           | Pthreads (POSIX threads), 556            | Reduction, 535                          |
| changing with conditional branch,     | PTX instructions, B-31, B-32             | Redundant arrays of inexpensive disks   |
| 334                                   | Public keywords, OL2.15-21               | (RAID), OL5.11-2-5.11-8                 |
| defined, 101, 263                     | Push                                     | history, OL5.11-8                       |
| exception, 459, 461                   | defined, 101                             | RAID 0, OL5.11-4                        |
| incrementing, 263, 265                | using, 104                               | RAID 1, OL5.11-5                        |
| instruction updates, 300              |                                          | RAID 2, OL5.11-5                        |
| Program performance                   | Q                                        | RAID 3, OL5.11-5                        |
| elements affecting, 39                | •                                        | RAID 4, OL5.11-5-5.11-6                 |
| understanding, 9                      | Quad words, 158                          | RAID 5, OL5.11-6-5.11-7                 |
| Programmable array logic (PAL), A-666 | Quicksort, 425, 426                      | RAID 6, OL5.11-7                        |
| Programmable logic arrays (PLAs)      | Quotient, 198                            | spread of, OL5.11-6                     |
| component dots illustration, A-604    |                                          | summary, OL5.11-7-5.11-8                |
| control function implementation, C-7, | R                                        | use statistics, OL5.11-7                |
| C-20-21                               |                                          | Reference bit, 450                      |
| defined, A-600                        | Race, A-661                              | References                              |
| example, A-601-602                    | Radix sort, 425, 426, B-63–65            | absolute, 131                           |
| illustrated, A-601                    | CUDA code, B-64                          | types, OL2.15-26                        |
| ROMs and, A-603-604                   | implementation, B-63-65                  | Register 31, 74, 102, 175               |
| size, C-20                            | RAID, See Redundant arrays of            | Register addressing, 120                |
| truth table implementation, A-601     | inexpensive disks (RAID)                 | Register allocation, OL2.15-11-2.15-13  |

Index

| Register files, A-638, A-642–644<br>defined, 264, A-638, A-642 | Roofline model, 558–559, 560, 561 with ceilings, 561 | Set instructions, 97<br>Set-associative caches, 417. <i>See also</i> |
|----------------------------------------------------------------|------------------------------------------------------|----------------------------------------------------------------------|
| in behavioral Verilog, A-645                                   | computational roofline, 559                          | Caches                                                               |
| single, 269                                                    | illustrated, 557                                     | address portions, 421                                                |
| two read ports implementation, A-643                           | Opteron generations, 558                             | block replacement strategies, 468                                    |
| with two read ports/one write port,                            | with overlapping areas shaded, 563                   | choice of, 467                                                       |
| A-643                                                          | peak floating-point performance, 562                 | four-way, 418, 421                                                   |
| write port implementation, A-644                               | peak memory performance, 562                         | memory-block location, 417                                           |
| Register-memory architecture, OL2.22-3                         | with two kernels, 563                                | misses, 419-420                                                      |
| Registers, 156, 157–158                                        | Rotational delay. See Rotational latency             | n-way, 417                                                           |
| architectural, 336–342                                         | Rotational latency, 397                              | two-way, 418                                                         |
| base, 69                                                       | Rounding, 226                                        | Setup time, A-641, A-642                                             |
| clock cycle time and, 67                                       | accurate, 226                                        | Shaders                                                              |
| compiling C assignment with, 68                                | bits, 228                                            | defined, B-14                                                        |
| defined, 67                                                    | with guard digits, 227                               | floating-point arithmetic, B-14                                      |
| destination, 85, 274                                           | IEEE 754 modes, 227                                  | graphics, B-14–15                                                    |
| floating-point, 226                                            | Row-major order, 225, 427                            | pixel example, B-15-17                                               |
| left half, 301                                                 | R-type instructions, 264                             | Shading languages, B-14                                              |
| LEGv8 conventions, 108                                         | datapath for, 276                                    | Shadowing, OL5.11-5                                                  |
| mapping, 82                                                    | datapath in operation for, 278                       | Shared memory. See also Memory                                       |
| number specification, 264                                      |                                                      | as low-latency memory, B-21                                          |
| page table, 446                                                | S                                                    | caching in, B-58-60                                                  |
| pipeline, 319, 321, 323                                        |                                                      | CUDA, B-58                                                           |
| primitives, 67                                                 | Saturation, 191                                      | N-body and, B-67-68                                                  |
| renaming, 348                                                  | SCALAPAK, 244                                        | per-CTA, B-39                                                        |
| right half, 301                                                | Scaling                                              | SRAM banks, B-40                                                     |
| spilling, 72                                                   | strong, 521                                          | Shared memory multiprocessors (SMP)                                  |
| Status, 337                                                    | weak, 521                                            | 531–535                                                              |
| temporary, 68, 102                                             | Scientific notation                                  | defined, 517, 531                                                    |
| variables, 67                                                  | adding numbers in, 213                               | single physical address space, 531                                   |
| Relative performance, 31-32                                    | defined, 205                                         | synchronization, 534                                                 |
| Relative power, 41                                             | for reals, 205                                       | Shift amount, 84                                                     |
| Reliability, 432                                               | Search engines, 4                                    | Shift instructions, 90                                               |
| Remainder                                                      | Secondary memory, 23                                 | Sign and magnitude, 206                                              |
| defined, 198                                                   | Sectors, 395                                         | Sign bit, 78                                                         |
| Reorder buffers, 355                                           | Secure Hash Algorithm (SHA)                          | Sign extension, 266                                                  |
| Replication, 479                                               | encryption, 488                                      | defined, 78                                                          |
| Requested word first, 406                                      | Seek, 396                                            | shortcut, 80                                                         |
| Request-level parallelism, 548                                 | Segmentation, 445                                    | Signals                                                              |
| Reservation stations                                           | Selector values, A-598                               | asserted, 262, A-592                                                 |
| buffering operands in, 350                                     | Semiconductors, 25                                   | control, 262, 274-275                                                |
| defined, 350                                                   | Send message routine, 545                            | deasserted, 262, A-592                                               |
| Response time, 30–31                                           | Sensitivity list, A-612                              | Signed division, 201-202                                             |
| Restartable instructions, 462                                  | Sequencers                                           | Signed multiplication, 196                                           |
| Return address, 100                                            | explicit, C-32                                       | Signed numbers, 75–82                                                |
| Return from exception (ERET), 459                              | implementing next-state function with,               | sign and magnitude, 77                                               |
| R-format, 274                                                  | C-22-28                                              | treating as unsigned, 98                                             |
| ALU operations, 265                                            | Sequential logic, A-593                              | Significands, 207                                                    |
| defined, 86                                                    | Servers, OL5. <i>See also</i> Desktop and server     | addition, 212                                                        |
| Ripple carry                                                   | RISCs                                                | multiplication, 215                                                  |
| adder, A-617                                                   | cost and capability, 5                               | Silicon, 25                                                          |
| carry lookahead speed <i>versus</i> ,                          | Service accomplishment, 432                          | as key hardware technology, 53                                       |
| A-634–635                                                      | Service interruption, 432                            | crystal ingot, 26                                                    |
|                                                                |                                                      |                                                                      |

| defined, 26                                | Smalltalk-80, OL2.22-8                | SPEC2000, OL1.12-12                  |
|--------------------------------------------|---------------------------------------|--------------------------------------|
| wafers, 26                                 | Smart phones, 7                       | SPEC2006, 246, OL1.12-12             |
| Silicon crystal ingot, 26                  | Snooping protocol, 479–481            | SPEC89, OL1.12-11                    |
| SIMD (Single Instruction Multiple Data),   | Snoopy cache coherence, OL5.12-7      | SPEC92, OL1.12-12                    |
| 522, 574                                   | Software optimization                 | SPEC95, OL1.12-12                    |
| computers, OL6.15-2-6.15-4                 | via blocking, 427-432                 | SPECrate, 554                        |
| data vector, B-35                          | Sort algorithms, 146                  | SPECratio, 47                        |
| extensions, OL6.15-4                       | Software                              | Special function units (SFUs), B-35, |
| for loops and, OL6.15-3                    | layers, 13                            | B-50                                 |
| massively parallel multiprocessors,        | multiprocessor, 516                   | defined, B-43                        |
| OL6.15-2                                   | parallel, 517                         | Speculation, 344-345                 |
| small-scale, OL6.15-4                      | as service, 7, 547, 574               | hardware-based, 352                  |
| vector architecture, 524-525               | systems, 13                           | implementation, 344                  |
| in x86, 524                                | Sort procedure, 140–144. See also     | performance and, 344                 |
| SIMMs (single inline memory modules),      | Procedures                            | problems, 344                        |
| OL5.17-5, OL5.17-6                         | code for body, 140-142                | recovery mechanism, 344              |
| Simple programmable logic devices          | full procedure, 143-144               | Speed-up challenge, 518              |
| (SPLDs), A-666                             | passing parameters in, 143            | balancing load, 518-519              |
| Simplicity, 171                            | preserving registers in, 143          | bigger problem, 520–521              |
| Simultaneous multithreading (SMT),         | procedure call, 143                   | Spilling registers, 72, 101          |
| 531-533                                    | register allocation for, 140          | Split algorithm, 568                 |
| support, 531                               | Sorting performance, B-54–55          | Split caches, 411                    |
| thread-level parallelism, 531              | Space allocation                      | Stack architectures, OL2.22-4        |
| unused issue slots, 531                    | on heap, 107–110                      | Stack pointers                       |
| Single error correcting/Double error       | on stack, 106                         | adjustment, 104                      |
| correcting (SEC/DEC), 434–436              | SPARC                                 | defined, 101                         |
| Single instruction single data (SISD), 523 | annulling branch, D-23                | values, 104                          |
| Single precision. See also Double          | CASA, D-31                            | Stacks                               |
| precision                                  | conditional branches, D-10-12         | allocating space on, 106             |
| binary representation, 210                 | fast traps, D-30                      | for arguments, 145                   |
| defined, 207                               | floating-point operations, D-31       | defined, 101                         |
| Single-clock-cycle pipeline diagrams, 308  | instructions, D-29-32                 | pop, 101                             |
| illustrated, 310                           | least significant bits, D-31          | push, 101, 104                       |
| Single-cycle datapaths. See also Datapaths | multiple precision floating-point     | Stalls, 291                          |
| illustrated, 298                           | results, D-32                         | as solution to control hazard, 292   |
| instruction execution, 299                 | nonfaulting loads, D-32               | avoiding with code reordering, 291   |
| Single-cycle implementation                | overlapping integer operations, D-31  | behavioral Verilog with detection,   |
| control function for, 281                  | quadruple precision floating-point    | OL4.13-6                             |
| defined, 281                               | arithmetic, D-32                      | data hazards and, 324-328            |
| nonpipelined execution versus              | register windows, D-29-30             | illustrations, OL4.13-23, OL4.13-30  |
| pipelined execution, 287                   | support for LISP and Smalltalk, D-30  | insertion into pipeline, 326         |
| non-use of, 284                            | Sparse matrices, B-55–58              | load-use, 329                        |
| penalty, 283                               | Sparse Matrix-Vector multiply (SpMV), | memory, 414                          |
| pipelined performance versus, 285          | B-55, B-57, B-58                      | write-back scheme, 413               |
| Single-instruction multiple-thread         | CUDA version, B-57                    | write buffer, 413                    |
| (SIMT), B-27–30                            | serial code, B-57                     | Standby spares, OL5.11-8             |
| overhead, B-35                             | shared memory version, B-59           | State                                |
| multithreaded warp scheduling, B-28        | Spatial locality, 388                 | in 2-bit prediction scheme, 333      |
| processor architecture, B-28               | large block exploitation of, 405      | assignment, A-658, C-27              |
| warp execution and divergence,             | tendency, 392                         | bits, C-8                            |
| B-29-30                                    | SPEC, OL1.12-11-1.12-12               | exception, saving/restoring, 462     |
| Single-program multiple data (SPMD),       | CPU benchmark, 46–48                  | logic components, 261                |
| R_22                                       | nower benchmark 48-49                 | specification of 446                 |

| State elements                                                                                                                                                                                                                                                                                                                                                                                                                               | Stream benchmark, 564                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | Symbol tables, 130                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| clock and, 262                                                                                                                                                                                                                                                                                                                                                                                                                               | Streaming multiprocessor (SM),                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Synchronization, 125–127, 568                                                                                                                                                                                                                                                                                                                                                                                                                      |
| combinational logic and, 262                                                                                                                                                                                                                                                                                                                                                                                                                 | B-48-49                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | barrier, B-18, B-20, B-34                                                                                                                                                                                                                                                                                                                                                                                                                          |
| defined, 260, A-636                                                                                                                                                                                                                                                                                                                                                                                                                          | Streaming processors, B-34, B-49-50                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | defined, 534                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| inputs, 261                                                                                                                                                                                                                                                                                                                                                                                                                                  | array (SPA), B-41, B-46                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | lock, 125                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| in storing/accessing instructions, 264                                                                                                                                                                                                                                                                                                                                                                                                       | Streaming SIMD Extension 2 (SSE2)                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | overhead, reducing, 44-45                                                                                                                                                                                                                                                                                                                                                                                                                          |
| register file, A-638                                                                                                                                                                                                                                                                                                                                                                                                                         | floating-point architecture, 232                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | unlock, 125                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Static branch prediction, 345                                                                                                                                                                                                                                                                                                                                                                                                                | Streaming SIMD Extensions (SSE) and                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Synchronizers                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Static data                                                                                                                                                                                                                                                                                                                                                                                                                                  | advanced vector extensions in x86,                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | defined, A-664                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| segment, 107                                                                                                                                                                                                                                                                                                                                                                                                                                 | 232–233                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | failure, A-665                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Static multiple-issue processors, 343,                                                                                                                                                                                                                                                                                                                                                                                                       | Stretch computer, OL4.16-2                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | from D flip-flop, A-664                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 345–349. See also Multiple issue                                                                                                                                                                                                                                                                                                                                                                                                             | Strings                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Synchronous DRAM (SRAM), 393-394,                                                                                                                                                                                                                                                                                                                                                                                                                  |
| control hazards and, 345                                                                                                                                                                                                                                                                                                                                                                                                                     | defined, 111                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | A-648, A-653                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| instruction sets, 345                                                                                                                                                                                                                                                                                                                                                                                                                        | in Java, 113–115                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Synchronous SRAM (SSRAM), A-648                                                                                                                                                                                                                                                                                                                                                                                                                    |
| with LEGv8 ISA, 345-348                                                                                                                                                                                                                                                                                                                                                                                                                      | representation, 111                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Synchronous system, A-636                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Static random access memories (SRAMs),                                                                                                                                                                                                                                                                                                                                                                                                       | Strip mining, 526                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | Syntax tree, OL2.15-3                                                                                                                                                                                                                                                                                                                                                                                                                              |
| 392, 393, A-646-650                                                                                                                                                                                                                                                                                                                                                                                                                          | Striping, OL5.11-4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | System calls                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| array organization, A-650                                                                                                                                                                                                                                                                                                                                                                                                                    | Strong scaling, 521                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | defined, 459                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| basic structure, A-649                                                                                                                                                                                                                                                                                                                                                                                                                       | Structural hazards, 288–289, 305                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | Systems software, 13                                                                                                                                                                                                                                                                                                                                                                                                                               |
| defined, 21, A-646                                                                                                                                                                                                                                                                                                                                                                                                                           | STUR (store register), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | SystemVerilog                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| fixed access time, A-646                                                                                                                                                                                                                                                                                                                                                                                                                     | STURB (store byte), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | cache controller, OL5.12-2                                                                                                                                                                                                                                                                                                                                                                                                                         |
| large, A-647                                                                                                                                                                                                                                                                                                                                                                                                                                 | STURH (store half), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | cache data and tag modules, OL5.12-6                                                                                                                                                                                                                                                                                                                                                                                                               |
| read/write initiation, A-647                                                                                                                                                                                                                                                                                                                                                                                                                 | STURW (store word), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | FSM, OL5.12-7                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| synchronous (SSRAMs), A-648                                                                                                                                                                                                                                                                                                                                                                                                                  | STXR (store exclusive register), 64, 126                                                                                                                                                                                                                                                                                                                                                                                                                                                                | simple cache block diagram, OL5.12-4                                                                                                                                                                                                                                                                                                                                                                                                               |
| three-state buffers, A-647, A-648                                                                                                                                                                                                                                                                                                                                                                                                            | SUB (subtract), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | type declarations, OL5.12-2                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Static variables, 106                                                                                                                                                                                                                                                                                                                                                                                                                        | SUBI (subtract immediate), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | -/                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Steady-state prediction, 333                                                                                                                                                                                                                                                                                                                                                                                                                 | SUBIS (subtract immediate and set flags),                                                                                                                                                                                                                                                                                                                                                                                                                                                               | T                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| *                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | -                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| Sticky bits, 228                                                                                                                                                                                                                                                                                                                                                                                                                             | 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| Sticky bits, 228<br>Store buffers, 355                                                                                                                                                                                                                                                                                                                                                                                                       | 64<br>SUBS (subtract and set flags), 64                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Tablets, 7                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Sticky bits, 228<br>Store buffers, 355<br>Store instructions. <i>See also</i> Load                                                                                                                                                                                                                                                                                                                                                           | 64<br>SUBS (subtract and set flags), 64<br>Subnormals, 230                                                                                                                                                                                                                                                                                                                                                                                                                                              | Tablets, 7<br>Tags                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Sticky bits, 228 Store buffers, 355 Store instructions. <i>See also</i> Load instructions                                                                                                                                                                                                                                                                                                                                                    | 64<br>SUBS (subtract and set flags), 64<br>Subnormals, 230<br>Subtraction, 188–191. <i>See also</i> Arithmetic                                                                                                                                                                                                                                                                                                                                                                                          | Tablets, 7 Tags defined, 398                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41                                                                                                                                                                                                                                                                                                                                              | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189                                                                                                                                                                                                                                                                                                                                                                                          | Tablets, 7 Tags defined, 398 in locating block, 421                                                                                                                                                                                                                                                                                                                                                                                                |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274                                                                                                                                                                                                                                                                                                                           | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. <i>See also</i> Arithmetic binary, 188–189 floating-point, 220                                                                                                                                                                                                                                                                                                                                                               | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448                                                                                                                                                                                                                                                                                                                                                                           |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71                                                                                                                                                                                                                                                                                                        | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190                                                                                                                                                                                                                                                                                                                                                 | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423                                                                                                                                                                                                                                                                                                                                                              |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126                                                                                                                                                                                                                                                                                       | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190                                                                                                                                                                                                                                                                                                                                   | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109                                                                                                                                                                                                                                                                                                                                               |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71                                                                                                                                                                                                                                                                           | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365,                                                                                                                                                                                                                                                                                                | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460                                                                                                                                                                                                                                                                                                                         |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305                                                                                                                                                                                                                                                             | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17                                                                                                                                                                                                                                                                                           | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24                                                                                                                                                                                                                                                                                                  |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302                                                                                                                                                                                                                                               | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242                                                                                                                                                                                                                                                              | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516                                                                                                                                                                                                                                                                      |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302                                                                                                                                                                                                                                 | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600                                                                                                                                                                                                                                | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5                                                                                                                                                                                                                                                    |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323                                                                                                                                                                                                     | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3                                                                                                                                                                                                       | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34                                                                                                                                                                                                                             |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304                                                                                                                                                                                      | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5                                                                                                                                                                                            | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33                                                                                                                                                                                               |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267                                                                                                                                                           | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40                                                                                                                                                                      | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33 barrier synchronization, B-34                                                                                                                                                                 |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304                                                                                                                                             | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars                                                                                                                                                         | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32                                                                                                                                   |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72                                                                                                                          | 64 SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5                                                                                                                                  | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-33–34                                                                                               |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72 Stored program concept, 63                                                                                               | SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5 dynamic pipeline scheduling, 349                                                                                                    | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-33–34 Temporal locality, 388                                                                        |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72 Stored program concept, 63 as computer principle, 88                                                                     | SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5 dynamic pipeline scheduling, 349 multithreading options, 516                                                                        | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-33–34 Temporal locality, 388 tendency, 392                                                          |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72 Stored program concept, 63 as computer principle, 88 illustrated, 89                                                     | SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5 dynamic pipeline scheduling, 349 multithreading options, 516 Surfaces, B-41                                                         | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31–34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-33-34 Temporal locality, 388 tendency, 392 Temporary registers, 68, 102                             |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72 Stored program concept, 63 as computer principle, 88 illustrated, 89 principles, 171                                     | SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5 dynamic pipeline scheduling, 349 multithreading options, 516 Surfaces, B-41 Swap procedure, 138. See also Procedures                | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31-34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-33-34 Temporal locality, 388 tendency, 392 Temporary registers, 68, 102 Terabyte (TB), 6            |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72 Stored program concept, 63 as computer principle, 88 illustrated, 89 principles, 171 Strcpy procedure, 112–113. See also | SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5 dynamic pipeline scheduling, 349 multithreading options, 516 Surfaces, B-41 Swap procedure, 138. See also Procedures body code, 138 | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31-34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-32-34 Temporal locality, 388 tendency, 392 Temporary registers, 68, 102 Terabyte (TB), 6 defined, 5 |
| Sticky bits, 228 Store buffers, 355 Store instructions. See also Load instructions access, B-41 base register, 274 compiling with, 71 conditional, 126 defined, 71 EX stage, 305 ID stage, 302 IF stage, 302 instruction dependency, 323 MEM stage, 304 unit for implementing, 267 WB stage, 304 Store register, 72 Stored program concept, 63 as computer principle, 88 illustrated, 89 principles, 171                                     | SUBS (subtract and set flags), 64 Subnormals, 230 Subtraction, 188–191. See also Arithmetic binary, 188–189 floating-point, 220 negative number, 190 overflow, 190 Subword parallelism, 230–232, 365, D-17 and matrix multiply, 238–242 Sum of products, A-599, A-600 Supercomputers, OL4.16-3 defined, 5 SuperH, D-15, D-39–40 Superscalars defined, 349, OL4.16-5 dynamic pipeline scheduling, 349 multithreading options, 516 Surfaces, B-41 Swap procedure, 138. See also Procedures                | Tablets, 7 Tags defined, 398 in locating block, 421 page tables and, 448 size of, 423 Tail call, 109 Task identifiers, 460 Task parallelism, B-24 Task-level parallelism, 516 Tebibyte (TiB), 5 Telsa PTX ISA, B-31-34 arithmetic instructions, B-33 barrier synchronization, B-34 GPU thread instructions, B-32 memory access instructions, B-33-34 Temporal locality, 388 tendency, 392 Temporary registers, 68, 102 Terabyte (TB), 6            |

| Thrashing, 464                          | Tree-based parallel scan, B-62         | Use latency                              |
|-----------------------------------------|----------------------------------------|------------------------------------------|
| Thread blocks, 542                      | Truth tables, A-593                    | defined, 346                             |
| creation, B-23                          | ALU control lines, C-5                 | one-instruction, 346                     |
| defined, B-19                           | for control bits, 272                  |                                          |
| managing, B-30                          | datapath control outputs, C-17         | V                                        |
| memory sharing, B-20                    | datapath control signals, C-14         |                                          |
| synchronization, B-20                   | defined, 272                           | Vacuum tubes, 25                         |
| Thread parallelism, B-22                | example, A-593                         | Valid bit, 400                           |
| Threads                                 | next-state output bits, C-15           | Variables                                |
| creation, B-23                          | PLA implementation, A-601              | C language, 106                          |
| CUDA, B-36                              | Two's complement representation, 77-78 | programming language, 67                 |
| ISA, B-31-34                            | advantage, 78                          | register, 67                             |
| managing, B-30                          | negation shortcut, 79-80               | static, 106                              |
| memory latencies and, B-74-75           | rule, 81                               | storage class, 106                       |
| multiple, per body, B-68-69             | sign extension shortcut, 80-81         | type, 106                                |
| warps, B-27                             | Two-level logic, A-599-602             | VAX architecture, OL2.22-4, OL5.17-7     |
| Three Cs model, 459-461                 | Two-phase clocking, A-663              | Vector lanes, 526                        |
| Three-state buffers, A-647, A-648       | TX-2 computer, OL6.15-4                | Vector processors, 523-524. See also     |
| Throughput                              |                                        | Processors                               |
| defined, 30-31                          | U                                      | conventional code comparison,            |
| multiple issue and, 342                 |                                        | 525–526                                  |
| pipelining and, 286, 342                | Unconditional branches, 94             | instructions, 524                        |
| Thumb, D-15, D-38                       | Underflow, 206                         | multimedia extensions and, 524-525       |
| Timing                                  | Unicode                                | scalar versus, 526-527                   |
| asynchronous inputs, A-664-665          | alphabets, 113                         | Vectored interrupts, 337                 |
| level-sensitive, A-663-664              | defined, 113                           | Verilog                                  |
| methodologies, A-660-665                | example alphabets, 114                 | behavioral definition of MIPS ALU,       |
| two-phase, A-663                        | Unified GPU architecture, B-10-12      | A-613                                    |
| TLB misses, 453. See also Translation-  | illustrated, B-11                      | behavioral definition with bypassing,    |
| lookaside buffer (TLB)                  | processor array, B-11-12               | OL4.13-4-4.13-6                          |
| handling, 461-465                       | Uniform memory access (UMA), 534,      | behavioral definition with stalls for    |
| occurrence, 461                         | B-9                                    | loads, OL4.13-6                          |
| problem, 464                            | multiprocessors, 534                   | behavioral specification, A-609,         |
| Tomasulo's algorithm, OL4.16-3          | Units                                  | OL4.13-2-4.13-4                          |
| Touchscreen, 19                         | commit, 350, 355                       | behavioral specification of multicycle   |
| Tournament branch predicators, 334      | control, 259, 271-273, C-4-8, C-10,    | MIPS design, OL4.13-12-4.13-13           |
| Tracks, 395-396                         | C-12-13                                | behavioral specification with            |
| Transfer time, 397                      | defined, 227                           | simulation, OL4.13-2                     |
| Transistors, 25                         | floating point, 227                    | behavioral specification with stall      |
| Translation Table Base Register (TTBR), | hazard detection, 324, 327–328         | detection, OL4.13-6                      |
| 449                                     | for load/store implementation, 267     | behavioral specification with synthesis, |
| Translation-lookaside buffer (TLB),     | special function (SFUs), B-35, B-43,   | OL4.13-11-4.13-16                        |
| 452-454, D-26-27, OL5.17-6. See         | B-50                                   | blocking assignment, A-612               |
| also TLB misses                         | UNIVAC I, OL1.12-5                     | branch hazard logic implementation,      |
| associativities, 454                    | UNIX, OL2.22-8, OL5.17-9-5.17-12       | OL4.13-8-4.13-10                         |
| illustrated, 453                        | AT&T, OL5.17-10                        | combinational logic, A-611-614           |
| integration, 457                        | Berkeley version (BSD), OL5.17-10      | datatypes, A-609–610                     |
| Intrinsity FastMATH, 454–457            | genius, OL5.17-12                      | defined, A-608                           |
| typical values, 454                     | history, OL5.17-9–5.17-12              | forwarding implementation, OL4.13-4      |
| Transmit driver and NIC hardware        | Unlock synchronization, 125            | MIPS ALU definition in, A-623–626        |
| time versus receive driver and NIC      | Unscaled signed immediate off set, 166 | modules, A-611                           |
| hardware time, OL6.9-8                  | Unsigned numbers, 75–82                | multicycle MIPS datapath, OL4.13-14      |

| Verilog (Continued)                              | Virtualizable hardware, 440              | handling, 407-409                     |
|--------------------------------------------------|------------------------------------------|---------------------------------------|
| nonblocking assignment, A-612                    | Virtually addressed caches, 458          | memory hierarchy handling of,         |
| operators, A-610                                 | Visual computing, B-3                    | 469-470                               |
| program structure, A-611                         | Volatile memory, 22                      | schemes, 408                          |
| reg, A-609–610                                   | ·                                        | virtual memory, 451                   |
| sensitivity list, A-612                          | W                                        | write-back cache, 408, 409            |
| sequential logic                                 |                                          | write-through cache, 408, 409         |
| specification, A-644–646                         | Wafers, 26                               | Write-stall cycles, 414               |
| structural specification, A-609                  | defects, 26                              | Write-through caches. See also Caches |
| wire, A-609-610                                  | dies, 26-27                              | advantages, 469                       |
| Vertical microcode, C-32                         | yield, 27                                | defined, 407, 469                     |
| Very large-scale integrated (VLSI)               | Warehouse Scale Computers (WSCs), 7,     | tag mismatch, 408                     |
| circuits, 25                                     | 545–550, 574                             | 8,                                    |
| Very Long Instruction Word (VLIW)                | Warps, 544, B-27                         | X                                     |
| defined, 345                                     | Weak scaling, 521                        | ^                                     |
| first generation computers, OL4.16-5             | Wear levelling, 395                      | x86, 154–162                          |
| processors, 345                                  | While loops, 95                          |                                       |
| VHDL, A-608–609                                  | Whirlwind, OL5.17-2                      | Advanced Vector Extensions in, 232    |
|                                                  | Wilde area networks (WANs), 24. See also | brief history, OL2.22-6               |
| Video graphics array (VGA) controllers,<br>B-3-4 | Networks                                 | conclusion, 162                       |
| Virtual addresses                                | Wide immediate operands, 115–117         | data addressing modes, 157–158        |
|                                                  | Words                                    | evolution, 154–157                    |
| causing page faults, 462                         |                                          | first address specifier encoding, 162 |
| defined, 442                                     | accessing, 69                            | instruction encoding, 161–162         |
| mapping from, 442–443                            | defined, 67                              | instruction formats, 161              |
| size, 444                                        | double, 158                              | instruction set growth, 170           |
| Virtual machine monitors (VMMs)                  | load, 69, 71                             | instruction types, 160                |
| defined, 438                                     | quad, 158                                | integer operations, 158–160           |
| implementing, 494                                | store, 71                                | registers, 157                        |
| laissez-faire attitude, 494                      | Working set, 464                         | SIMD in, 522                          |
| page tables, 463                                 | World Wide Web, 4                        | Streaming SIMD Extensions in,         |
| in performance improvement, 441                  | Worst-case delay, 283                    | 232–233                               |
| requirements, 440                                | Write buffers                            | typical instructions/functions, 161   |
| Virtual machines (VMs), 438–441                  | defined, 408                             | typical operations, 161               |
| benefits, 438                                    | stalls, 413                              | Xerox Alto computer, OL1.12-8         |
| illusion, 463                                    | write-back cache, 409                    | XMM, 232                              |
| instruction set architecture support,            | Write invalidate protocols, 479          |                                       |
| 441                                              | Write serialization, 479                 | Υ                                     |
| performance improvement, 441                     | Write-back caches. See also Caches       | 1                                     |
| for protection improvement, 438                  | advantages, 469                          | V-b I Cl I C                          |
| Virtual memory, 441–465. See also Pages          | cache coherency protocol, OL5.12-5       | Yahoo! Cloud Serving Benchmark        |
| address translation, 443, 452-454                | complexity, 409                          | (YCSB), 556                           |
| integration, 457–459                             | defined, 408, 469                        | Yield, 27                             |
| for large virtual addresses, 450-451             | stalls, 413                              | YMM, 232                              |
| mechanism, 464                                   | write buffers, 409                       |                                       |
| motivations, 427-442                             | Write-back stage                         | Z                                     |
| page faults, 442, 448                            | control line, 313                        | _                                     |
| protection implementation, 459-460               | load instruction, 303                    | Zettabyte, 6                          |
| segmentation, 445                                | store instruction, 305                   | 20140710,0                            |
| summary, 463                                     | Writes                                   |                                       |
| virtualization of 463                            | complications 408                        |                                       |

expense, 464

writes, 452