**CS 4290: Advanced Computer Organization**

**Homework 3 - 100 points**

**Due : 07/21/2020, 11:55 pm, EST (via Canvas)**

**Important policies:**

1. You may submit this assignment as a group. Please ensure to have the name of every member of the group included at the top of the submission. It is highly encouraged you first attempt the homework individually, and then compare answers amongst each other in the group to catch possible mistakes.
2. Make and write suitable assumptions where necessary highlighting them. We also highly encourage that you show your work for your solutions, as we may use that as a guideline to award partial credit.
3. Please make sure to correctly follow all submission guidelines, whether it comes to answer formats or to submitted files. Failure to do so may result in a loss of points.
4. Unfortunately experience has shown that there is a very high chance that there are errors in this Homework. The online version will be updated as errors are discovered. It is your responsibility to check the website often and download new versions of this project description as they become available.

|  |  |
| --- | --- |
| **Question 1 - Multicore** | **15 points** |

1. List out the categories of Flynn’s Taxonomy and give an example architecture for each category.
   1. SISD: Single I, Single D Stream, traditional uniprocessor
      1. Most conventional computers have this architecture.
   2. SIMD: Single I, Multiple D Stream, can execute the same instructions but works on different data streams
      1. Cray’s vector processing machines
   3. MISD: Multiple I, Single D Stream, systems operate on the same data stream and must agree on the result
      1. Space shuttle flight control computer
   4. MIMD: Multiple I, Multiple D Streams, multiple instructions performed on multiple data sets
      1. IBM’s Symmetric Multi-Processing System
2. Compare and contrast the two types of memory models a multiprocessor system could implement.
   1. Message-Passing:
      1. Pros:
         1. A processor can directly address local memory only, meaning there is all communication must be explicit
         2. Each computer will have its own processor and memory
         3. Simpler and cheaper hardware to make
         4. Explicit communication can be used to directly tell what memory operations are the most costly
      2. Cons:
         1. Explicit communication is difficult to program
         2. Optimization must be done by hand
   2. Distributed Shared Memory:
      1. Pros:
         1. Communication is automatically
         2. Programs are easier to write and optimize
         3. No need to manually distribute data
      2. Cons:
         1. More hardware support necessary
         2. Programs are easy to write correctly, but may not be efficient
3. What is the difference between process and threads?
   1. Process:
      1. A program running on a machine
   2. Thread:
      1. A context within a program
      2. Shares a virtual address space with other threads
      3. Used for parallelism in programs

|  |  |
| --- | --- |
| **Question 2 - SMT** | **25 points** |

1. Detail the pros and cons of Simultaneous Multi-Threading?
   1. Pros:
      1. Instructions from different threads allows for less stall time, when one thread is stalled, another goes.
      2. It fills idle “issue slots” with work from other threads and throughput improves
   2. Cons:
      1. More hardware
         1. For N-way (N threads) we need N sets of registers, N RATs, and N virtual memory spaces
      2. More complex logic, need to maintain interrupts, exceptions, faults on a per-thread basis
      3. Can cause performance degradation by doing multiple tasks at the same time
2. Describe the changes that need to be made to add SMT support to an Out-of-Order superscalar architecture. Make sure to specifically address what units need to be replicated, as well as changes to existing units.
   1. For an N-way SMT
      1. N registers
      2. N RATs
      3. N virtual memory spaces
   2. We do not need to replicate the entire
3. Describe the cases where one-thread may stall the execution of another thread when performing SMT execution.
   1. One thread may stall the execution of another thread when both threads are attempting to access/manipulate the same portion of memory. Both threads cannot write to the same portion of memory at the same time otherwise data will be lost.

|  |  |
| --- | --- |
| **Question 3 - Cache Coherence** | **30 points** |

1. Given a machine with the following details:

* 64-bit machine with a byte-addressable 64GB physical memory space that uses 4KB pages
* An 32KB 4-way set-associative write-back, write-allocate cache
  + Cache block is 64B
  + Cache uses true LRU replacement policy
  + Cache is virtually indexed and virtually tagged
  + Cache implements MOSI protocol.

How big (in bits) is the tag store?

* Tag Store = Number of Lines \* (Tag Size + LRU Size + bits for states)
* Tag Store = (512) \* (128 + 2 + 4) = 68608 bits

1. Given a machine with the following details:

* 64-bit machine with a byte-addressable 64GB physical memory space that uses 4KB pages
* An 32KB 4-way set-associative write-back, write-allocate cache
  + Cache block is 64B
  + Cache uses true LRU replacement policy
  + Cache is virtually indexed and virtually tagged
  + Cache implements MOESI protocol.

How big (in bits) is the tag store?

* Tag Store = Number of Lines \* (Tag Size + LRU Size + bits for states)
* Tag Store = (512) \* (128 + 2 + 5) = 69120 bits

1. Given a two processor system (P0 and P1), with each processor paired with 2-entry fully associative private cache. The caches are kept coherent by using the MOSI protocol with bus intervention. There are 2 possible addresses in our program: A and B.

If the following stream of accesses is seen:

P0 Read A, P1 Read A, P0 Read A, P1 Write A, P0 Read A, P0 Write A

Then, assuming that the caches are initially empty, what is the miss rate of P0?

* 67% miss rate

1. Given the same details from Part C, say that the hit time of the local cache is 4 cycles, accessing data from a neighbor cache is 10 cycles, and accessing data from memory is 100 cycles, then how long will it take for P0 to complete its memory accesses?
2. Given a four processor system (P0, P1, P2, and P3), where each processor has a 4-entry fully associative private cache that uses the MESI protocol. Initially, P0’S private cache contains the following state:

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
|  | **A** | **B** | **C** | **D** |
| **State** | E | S | I | M |

If the following stream of requests is seen, how will the cache state change after each request?

P1 GetS A, P2 GetS A, P2 GetM B, P3 GetS D, P1 GetM A, P3 GetM D

1. What are mechanisms to reduce the metadata overhead for a directory-based coherence system? Make sure to mention what protocol changes may be need to be made.

|  |  |
| --- | --- |
| **Question 4 - Consistency** | **15 points** |

1. What is the difference between coherence and consistency?
   1. Coherence is defined as insurance that writes to a particular location will ALWAYS be seen in order.
   2. Consistency insures that writes to different locations will be seen in the order that makes the most sense, allowing for reordering.
2. Given Processor A that implements a weak consistency model, how would we modify a program that requires sequential consistency to run on Processor A?
   1. Sequential consistency requires all accesses for each processor to be kept in order. If we implemented a weak consistency model, we could keep synchronization accesses the same, but all writes in the system must fully complete before accesses occur and no data accesses are allowed AT ALL until all previous synchronization accesses have been completed.
3. Given Processor B that implements a release consistency model, how would we modify a program that requires sequential consistency to run on Processor B?
   1. We would not need to modify a program that requires sequential consistency because a release consistency model upholds sequentially consistent acquire and release fences.

|  |  |
| --- | --- |
| **Question 5 - Synchronization** | **10 points** |

1. To implement synchronization operations such as mutex locks or barriers, what considerations should be made in terms of cache coherence to gain the best performance.
   1. Must consider whether the thread is modifying the data or just reading the data. If both threads are simply trying to read the data from a shared variable, it can be allowed because they are not modifying that memory location.
   2. Casualty of writes must also be taken into consideration. Since all writes will be seen in the same order by all processors, a request queue for the memory must be made, otherwise a new thread could essentially “cut the line” and grab the memory before a process that has been waiting longer, essentially starving it.
2. If test-and-set is sufficient to create a mutex lock, why is test-and-test-and-set the preferred mechanism?
   1. It is the most simple mechanism, requires less programming to implement, and will return whether the mutex is busy or not in the fastest time possible.

**Deliverables:**

* **hw3.pdf**
  + A pdf file containing answers to each homework questions. If a sample answer format for a question is given, please make sure to follow it.