**CSCE 692 – Lab 3**

**Microarchitecture Experimentation**

Due: 28 Feb 2018 @1100

Assigned Benchmark: **CC1**

Assigned Partners: **\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_**

**Project Objective:** The objective of this project is to give the student experience in (1) the simulation and analysis of microprocessor architectures at a detailed level, and (2) making design decisions and tradeoffs based on performance *and* cost.

**Project Overview**: This project will use the SimpleScalar simulation tool to evaluate one of the benchmarks from Lab 2 with a focus on performace/cost (cost in area). You will work in teams as assigned. You will have to run **many** simulations. Do not leave it until the last minute and expect to be able to finish.

**Report Format:** The formal technical report (one per team) will be delivered in hard copy (without scripts or appendices). Separately, please email a soft copy of the report, together with any code, scripts, or appendices. The technical report will follow the 5-Chapter format of a thesis, albeit much shorter. For this report, consider a 5-Section (with preceding abstract) format as follows:

1. **Abstract** – (3-5 sentences is best)
   1. Highlight intent and findings. Give a reader a reason to read the introduction
2. **Introduction** – (1 page max, ½ is better)
   1. Present and motivate the problem, bring the reader into the problem and make it interesting; highlight relevant issues and expectations, foreshadow conclusions.
3. **Background** – (1 page max, ½ is better)
   1. What related work is relevant to this issue, what additional information will a reader need in order to understand the issues of this problem.
4. **Methodology** – (10 page max)
   1. Explicit state how you intend to solve the problem. What assumptions are you making? Why is your approach valid? Explain the setup and execution of any experiments. What does it cost you to implement the experiments and arrive at a solution? How will you analyze your solution?
5. **Analysis** – (10 page max)
   1. Present data from your research, including any experiments, and walk the reader through your analysis, justifying the steps. Include any necessary data, figures, tables etc.
6. **Conclusion** – (1 page max, ½ is better)
   1. Present and support your conclusions. Be as brief as possible (but no briefer ☺).
   2. Include an individual discussions of lessons learned, i.e., each team member should independently produce their own list of lessons learned… please make it clear which lessons come from whom.

**Tips:** When running SimpleScalar, it is always a good idea to check the output to make sure that

(a) there were no simulator errors in the run and

(b) the configuration used matches the configuration you intended to use.

**CSCE 692 - Lab 3: Microarchitecture Experimentation**

**A.** An architect must make trade-offs to design a processor with the best performance for a given cost. For this problem, we will use chip area as the cost[[1]](#footnote-1). Table 1 lists the costs for the different types of resources. Using this table, you can calculate the area cost of any microprocessor in the space we are considering. For example, an inorder processor with a machine width of 1, 1 integer ALU, 1 integer multiplier, 1 memory port, 1 FP ALU, 1 FP multiplier, and a not taken branch predictor would require 1350 units2.

**Table 1: Resource Cost**

|  |  |  |
| --- | --- | --- |
| **Resource** | **Area (units2)** | **Possible Values** |
| Machine width[[2]](#footnote-2) (*w*) | 250 *w* | 1, 2, 4 |
| Integer ALU (*ia*) | 100 *ia* | 1-4 |
| Integer multiplier (*im*) | 200 *im* | 1-4 |
| Memory port (*m*) | 300 *m* | 1-4 |
| FP ALU (*fa*) | 150 *fa* | 1-4 |
| FP multiplier (*fm*) | 250 *fm* | 1-4 |
| Branch predictor | 50 | Taken, Not-taken |
| 2-level predictor | 500 | --- |
| In-order issue | 50 | --- |
| Out-of-order issue | 170% of area of a similarly equipped in-order processor | --- |

**Find the three processor configuration[[3]](#footnote-3)[[4]](#footnote-4) for your assigned benchmark that yields the best performance with area less than or equal to:**

* 2600 units2
* 4000 units2
* 5000 units2

You should NOT need to run hundreds of simulations. (E.g., if the ALU is underloaded, then adding another will not increase performance much). Plan carefully, and use simulations results to further refine the search. How you did this should be clearly documented in the methodology section of the report.

**Accurately describe your methodology for optimal configuration discovery**. Present the performance (CPI), the area, and the performance per unit area for your final configurations. You will be graded both on your methodology and on the final solutions you come up with. **Discuss the following**:

1. How does the performance per unit area compare among the three configurations you found? Do you think it is worth it to use the larger configurations?
2. Include a table of the configurations, area, and CPI for the simulations you ran while searching for the best solutions.

**B.** The configuration of the cache also plays a role in performance. For this question, use the processor configuration you found (in Part A of this lab) to be the best processor with an area less than or equal to 2600 for your assigned benchmark.

For this question you will vary the L1 data cache configuration for this processor. In the configuration file, the block size is given in bytes. Table 2 shows the ranges of parameters you should explore.

**Table 2: Cache Parameters**

|  |  |
| --- | --- |
| **Parameter** | **Possible Values** |
| Cache Size | 16KB, 64KB |
| Block Size (bytes) | 8, 16, 32 |
| Associativity | 1, 2, 4 |
| Replacement Policy | LRU |

Remember that the size of the cache is the product of the number of sets, the block size, and the associativity, so once you have picked the block size, associativity, and cache size in the simulator the number of lines in the cache is uniquely determined (i.e., SimpleScaler does not let you specify the cache size directly).

Also, larger memories are slower than smaller memories. Assume that a 64KB cache takes 2 cycles to access and that a 16KB cache takes a single cycles to access (in reality the access time is also dependent on the associativity). Make sure to address the following items in your report:

1. What is the L1 data cache miss rate and the CPI of your benchmark for this processor configuration for each of the possible cache configurations? Report the cache configuration, miss rate, and the CPI for each configuration in a table.
2. Under what circumstances would a 64KB cache be more valuable than a 16KB cache for this application?
3. Were you able to outperform the larger processors you found in Section Asimply by varying the cache configuration? Do you think it would be possible to do so with more freedom to vary the cache configuration? Why or why not?
4. Do you think the cache configuration would matter more or less for applications with larger data set sizes than this one?

1. The area values are gross approximations to the area actually consumed by these resources in a real processor (e.g., increasing issue width of an out-of-order processor is a function of *w*2, not *w* as we are using here). [↑](#footnote-ref-1)
2. “Machine width” implies that fetch, decode, issue, and commit are all equal. [↑](#footnote-ref-2)
3. If your benchmark does not benefit from additional resources, you must explicitly set these arguments to 1; otherwise, the default values (e.g., 4 for –res:fpalu) will be used, yielding inappropriate CPI measurements. [↑](#footnote-ref-3)
4. For consistency, please used a cache replacement policy of Least Recently Used (LRU) [↑](#footnote-ref-4)