It really is beautiful when the abstract logic connects direclty to the physical
silicon. You are starting to see the "Matrix" of how software actually runs!

These ... how we build instructions to how fast the computer can execute them. 


---
1. THE MEANING OF "PERFORMANCE" (P3)
   When we talk about performance, we aren't just talking about abstract speed.
   It depends on your goal:
   - PURCHASING PERSPECTIVE: You want the best performance for your budget.
   - DESIGN PERSPECTIVE: As an architect, you want to build a machine that 
     out-performs competittors while keeping manufacturing costs low.

   To do either of these, you need a standard, mathematical way to calculate and
   compare performance so you aren't just guessing.


---
2. THE GOLDEN EQUATION (P4)     
   This is the most important formula in computer architecture. Do not just look
   at a computer's "GHz" (clock speed) to know if it is fast! The true metric is
   EXECUTION TIME.

   `EXECUTION TIME = Instruction Count \times CPI \times Cycle Time`

   - INSTRUCTION COUNT: How many lines of assembly code (like the 32-bit R-types
     and J-types we just looked at) does it take to run your program?
   - CPI (CYCLES PER INSTRUCTION): On average, how many "ticks" of the 
     computer's clock does it take to finish one instruction? (Addition migh
     take 1 tick; pulling data from RAM might take 10 ticks).
   - CYCLE TIME: The physical length of one clock tick in seconds (e.g., 250 ps)

   EXAMPLE: ...


---    
3. COMPARING TWO PROCESSORS (P5 && P6)
   These slides present a math problem: How do you compare two processors (M1
   and M2) that have different clock speeds and different CPIs?

   WHY IT MATTERS: Your professor stresses that you should compare the ratio of
   their execution times, NOT just compare how many instructions they can do per
   second.
      - EXAMPLE: If M1 finishes a task in 10 seconds, and M2 finishes the 
        identical task in 5 seconds, M2 is twice as fast. You just plug the 
        `Count`, `CPI`, and `Clock Speed` into the Golden Equation for both, and
        divide them.


---
4. WHAT AFFECTS THE GOLDEN EQUATION? (P7)
   This table is brilliant because it connects software engineering directly to
   hardware performance.

   - ALGORITHM: Affects Instruction Count and CPI.
      - Why Count? An inefficient sorting algorithm will require the CPU to 
        execute millions of extra instructions.
      - Why CPI? If your algorithm relies heavily on division (a slow operation
        taking many cycles), your average CPI goes up. If you rewrite it to use
        bit-shifts (a fast operation), your CPI drops.
   - LANGUAGE: Affects Instruction Count and CPI.
      - Why? A low-level language like C compiles down to very few instructions.
        A high-level language like Java requires a virtual machine and indirect
        calls, meaning more instructions and higher CPI.
   - COMPILER: Affects Instruction Count and CPI.
      - Why? A "smart" compiler might realise your code has a useless loop and
        delete it entirely before it even becomes assembly code, slashing the
        Instruction Count.
   - INSTRUCTION SET (ISA): Affects Count, CPI and Clock Rate.
      - Why? If the blueprint (like RISC-V) is simple, you can clock the chip
        faster.
   - ORGANISATION & TECHNOLOGY: Affects Clock Rate.
      - Why? This is purely the electrical engineering side. Using better 
        silicon manufacturing lets the chip tick faster.

In [None]:
5. HISTORICAL CHALLENGES (P8)
   This graph tells the story of the last 40 years of computing.
      - The CISC Era: Computers had massive, complex instructions. Performance
        growth was steady but slow (22% per year).
      - The RISC Era: Architects realised that by forcing instructions to be
        dead-simple and uniform (like the 32-bit R-types we discussed), they
        could drastically lower the CPI. Performance exploded (52% per year).
      - The Multicore Era: Eventually, we hit the physical limits of silicon
        thermodynamics (End of Dennard Scaling). We couldn't safely push clock
        speeds higher without melting the chip, so we just started gluing
        multiple processors together (Multicore).
        

---
6. THE REAL-WORLD PROOF: RISC vs CISC (P9)
   The slide uses the Golden Equation to prove why RISC beat CISC. It compares
   an old SUN 68000 (CISC) to a newer SUN RISC (SPARC).
   - The Disadvantages of RISC: The RISC machine required more instructions to 
     do the same job (Instruction count ratio of 1.25), and its physical clock
     was actually slower (60ns vs 40ns cycle time).
   - WHY DID IT WIN? Look at the CPI. The CISC machine took 5.0 to 7.0 cycles
     per instruction. The RISC machine only took 1.3 to 1.7 cycles!
   - THE RESULT: Even though the RISC machine had a worse clock and had to read
     more lines of code, because every instruction finishes so rapidly (low CPI)
     , it completed the entire program in half the time.  

... Going from the physical constraints of an architecture to how we actually
measure its success is the perfect progression.


---
1. THE RISC Secret && The Compromise (P10)
   This page formally spells out exactly why the RISC architecture (like the
   RISC-V you are studying) took over the computing world.
   - THE PRINCIPLE: "Make the common case fast." By making instructions 
     incredibly simple and uniform (always 32 bits, always doing one small task)
     , hardware engineers can optimise the physical silicon to run those basic 
     tasks lightning-fast. This aggressively reduces the CPI (Cycles Per
     Instruction) and allows for a faster clock cycle time.
   - THE TRADE-OFF: There is no free lunch in CS. Because each instruction is so
     simple, you need more of them to accomplish a complex task.
   - EXAMPLE: If a CISC processor has a complex "Calculate Square Root" 
     instruction, it takes 1 line of code but maybe 50 clock cycles to finish. A
     RISC processor might require 15 lines of basic math instructions to 
     calculate that same square root, but each line only takes 1 clock cycle.
     RISC wins because 15 total cycles is still much faster than 50.


---
2. HOW TO KEEP GETTING FASTER (P11)
   Since we hit the physical limit of how fast we can make a single processor
   "tick" (the "End of Dennard Scaling" mentioned earlier), computer architects
   had to get creative to keep improving performance:

   - CACHES (Fast Local Store): Fetching data from main memory (RAM) is
     agonisingly slow. Caches act like a small, ultra-fast desk drawer right
     inside the CPU so it doesn't have to walk to the filing cabinet (RAM) every
     time. 
   - PIPELINING & SUPERSCALAR (Concurrent Execution): Instead of waiting for one
     instruction to finish completely before starting the next, the CPU operates
     like an assembly line, overlapping the fetch, decode, and execute steps of
     multiple instructions simultaneously.
   - DOMAIN-SPECIFIC HARDWARE: This is TPU concept from L1! For specific tasks
     like AI or graphics, we use customised silicon (like GPUs or FPGAs) instead
     of general-purpose CPUs.


---
3. EVALUATING PERFORMANCE HONESTLY (P12 && P13)
   How do you actually prove your new CPU design is better? You cannot just rely
   on clock speed or "MIPS" (Million Instruction Per Second).
      - WHY MIPS IS FLAWED: If a RISC processor runs 10 million simple 
        instructions per second, and a CISC processor runs 5 million complex
        instructions per second, MIPS makes the RISC look twice as fast. But if
        the CISC's complex instructions were doing significantly more work, it
        might finish the actual program sooner!
      - THE SOLUTION: We use Benchmarks.
         - ACTUAL TARGET WORKLOAD: Testing the exact software the user runs.
           Extremely accurate, but not portabele to other users.
         - FULL APPLICATION BENCHMARKS: A standardised suite of real-world 
           programs. This is highly representative and widely used.
         - SMALL KERNEL BENCHMARKS: Testing tiny, specific loops of code. Great
           for early hardware design, but rarely reflects true real-world
           performance.


---
4. The SPEC Benchmark in Action (P14 & 15)
   SPEC (System Performance Evaluation Co-operative) is the gold standard for
   full-application benchmarking. The CPU2000 version runs 26 different 
   intensive programs (like compiling C code, chess AI, or video compression) to
   stress-test the CPU.

   - THE TABLE ANALYSIS: Look at the `mcf` (Combinatorial optimisation) row for
     the Opteron X4 processor. Its CPI is a disastrous 10.00. The note points 
     out this is due to "High cache miss rate". thE cpu kept looking for data
     in its fast local cache, couldn't find it, and had to wait idly while data
     was fetched from slow RAM.
   - THE GEOMETRIC MEAN: To find the overall score across all these wildly 
     different tasks, they compare the execution times against a "reference
     machine" to get a ratio, and then use the Geometric Mean to average those
     ratios so the math isn't skewed by one outlier.


---
5. A REAL-WORLD SHOWDOWN: Pentium III vs Pentium 4 (P16)
   This final slide is a perfect culmination of everything you've learned. It
   compares the SPEC benchmark scores of two processors across different clock
   speeds.

   - THE MYSTERY: ...
   - THE ANSWER: The Pentium 4 included a brand new set of instructions called 
     the Streaming SIMD Extensions 2 (SSE).
   - THE CONNECTION TO THE GOLDEN EQUATION: By updating the ISA to include these
     new SSE instructions, the compiler could translate floating-point math
     differently. This altered both the Instruction Count and the CPI 
     specifically for floating-point tasks, drastically lowering the overall
     Execution Time for those programs.

---

- ... a MULTIPLEXOR (often abbreviated as MUX) is a combinatorial logic circuit
  that selects one of many input signals and forwards it to a single output line
  . Think of it as a digitally controlled switch or "traffic controller" that
  decides which data source gets to use a shared resource at any given moment.

  KEY COMPONENTS
     A multiplexor consists of three main types of lines:
     - DATA INPUTS ($2^n$): The various signal sources that are waiting to be
       sent to the output.
     - SELECTED LINES ($n$): Control signals used to specify which particular
       input should be connected to the output. The number of select lines
       ($n$) determines the maximum number of inputs ($2^n$).
     - OUTPUT (1): The single line where the selected input data is transmitted.


   ROLES IN COMPUTER SYSTEMS
      Multiplexors are essential building blocks found throughout a processor's
      architecture:
      - Arithmetic Logic Unit (ALU): MUXes are used to select which operands 
        (e.g., from registers or immediate values) are fed into the ALU for a
        calculation.
      - DATA ROUTING: They manage the flow of data across internal buses, 
        choosing which component (like memroy or an I/O device) has access to 
        the bus.
      - PROGRAM COUNTER (PC) Logic: Used to select the next instruction address,
        deciding between the next sequential address or a jump addresses for 
        branches.    
      - MEMORY ACCESS: MUXes help in addressing specific memory locations and
        picking the correct data out of caches.  

- An operand in a computer architecture is the data component of an instruction,
  representing the specific value, memory location, or register being 
  manipulated by an operator (opcode). It acts as the object of an operation
  (e.g., in `ADD A, B`, A and B are operands), defining waht data is processed.

  KEY DETAILS ABOUT OPERANDS
     - ROLE: Operands provide the data needed for arithmetic, logical, or data
       transfer operations.
     - TYPES: They can be immediate values (constants), register identifiers
       (e.g., `EAX`), memory addresses (RAM locations), or labels.
     - INSTRUCTION FORMAT: An instruction typically consists of an opcode (what
       to do) and one or more operands (what to do it to).
     - ADDRESSING MODES: Addressing modes define how the operand is located, 
       such as directly within the instruction, in a register, or in memory.      