|  |  |
| --- | --- |
| **Computer Architectures**  **02LSEOV** | Delivery date:  **By 2:00 AM on October, 16 2024** |
| **Laboratory**  **2** | The expected delivery of lab\_02.zip must include:   * **program\_1.s** * This file, filled with information and possibly compiled in a pdf format. |

Please configure the WinMIPS64 simulator with the *Initial Configuration* provided below c):

* *Integer ALU: 1 clock cycle*
* *Data memory: 1 clock cycle*
* Code address bus: 12
* Data address bus: 12
* FP arithmetic unit: pipelined, 4 clock cycles
* FP multiplier unit: pipelined, 6 clock cycles
* FP divider unit: not pipelined, 30 clock cycles

1. Write an assembly program (**program\_1.s**) for the *WinMIPS64* architecture described before being able to implement the following high-level code:

for (i = 31; i >= 0; i--){

v4[i] = v1[i]\*v1[i] – v2[i];

v5[i] = v4[i]/v3[i] – v2[i];

v6[i] = (v4[i]-v1[i])\*v5[i];

}

Assume that the vectors v1[], v2[], and v3[] have been previously allocated in memory and contain 32 double-precision **floating-point values;** also assume that v3[] doesnot contain 0 values. Additionally, the vectors v4[], v5[], v6[] are empty vectors also allocated in memory.

**Calculate** the data memory footprint of your program:

|  |  |
| --- | --- |
| Data | Number of bytes |
| V1 |  |
| V2 |  |
| V3 |  |
| V4 |  |
| V5 |  |
| V6 |  |
| Total |  |

Are there any issues? Yes, where and why? No? Do you need to change something?

|  |
| --- |
| Your answer: |

ATTENTION: WinMIPS64 has a limitation regarding the maximum length of the string when declaring a vector. It is therefore recommended to split the elements of the vectors into multiple lines: this also increases readability.

Example: my\_fancy\_vector: .byte 8, 12 ,2, 9

.byte 49,77, 28

.byte ……

* + - Calculate the CPU performance equation (CPU time) of the above program by assuming a clock frequency of 15 MHz:

By definition:

* + - * CPI is equal to the number of clock cycles required by the related functional unit to execute the instruction (EX stage).
      * IC*i* is the number of times an instruction is repeated in the referenced source code.
    - Recalculate the CPU performance equation assuming that you can triple the speed by just one unit of your choice between the FP multiplier or the FP divider:
      * FP multiplier unit: 6 🡪 2 clock cycles

*or*

* + - * FP divider unit: 30 🡪 10 clock cycles

Table 1: CPU time by hand

|  |  |  |  |
| --- | --- | --- | --- |
|  | Initial CPU time (a) | CPU time  (b – MUL speeded up) | CPU time  (b – DIV speeded up) |
| program\_1.s |  |  |  |

* + - Using the simulator, calculate the CPU time again and fill in the following table:

Table 2: CPU time using the simulator

|  |  |  |  |
| --- | --- | --- | --- |
|  | Initial CPU time (a) | CPU time  (b – MUL speeded up) | CPU time  (b – DIV speeded up) |
| program\_1.s |  |  |  |

Are there any differences? If so, where and why? If not, please provide some comments in the box below:

|  |
| --- |
| Your answer: |

* + - Using the simulator and the *Initial Configuration*, enable the Forwarding option and compute how many clock cycles the program takes to execute.

Table 3: forwarding enabled

|  |  |  |
| --- | --- | --- |
|  | Number of clock cycles | IPC (Instructions Per Clock) |
| program\_1.s |  |  |

Enable one at a time the *optimization features* that were initially disabled and collect statistics to fill the following table (fill all required data in the table before exporting this file to pdf format to be delivered).

Table 4: **Program performance for different processor configurations**

|  |  |  |  |  |  |  |  |  |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Program | Forwarding | | Branch Target Buffer | | Delay Slot | | Forwarding + Branch Target Buffer | |
|  | IPC | CC | IPC | CC | IPC | CC | IPC | CC |
| program\_1.s |  |  |  | |  | |  | |

1. Using the WinMIPS64 simulator, validate experimentally the Amdahl’s law, defined as follows:

![](data:image/x-wmf;base64,183GmgAAAAAAACArIAYACQAAAAARcwEACQAAA+MBAAAEABwAAAAAAAUAAAAJAgAAAAAFAAAAAgEBAAAABQAAAAEC////AAUAAAAuARgAAAAFAAAACwIAAAAABQAAAAwCIAYgKxIAAAAmBg8AGgD/////AAAQAAAAwP///7b////gKgAA1gUAAAsAAAAmBg8ADABNYXRoVHlwZQAA4AEIAAAA+gIAABAAAAAAAAAABAAAAC0BAAAFAAAAFAIAAjgJBQAAABMCAALsEwgAAAD6AgAACAAAAAAAAAAEAAAALQEBAAUAAAAUAt4DMiIFAAAAEwLeA74qBAAAAC0BAAAFAAAAFAIAAoQVBQAAABMCAALeKhwAAAD7AiD/AAAAAAAAkAEAAAAABAIAEFRpbWVzIE5ldyBSb21hbgDA8BgAOJeMdoABkHa0H2aOBAAAAC0BAgALAAAAMgrDBTYnCAAAAGVuaGFuY2VkCwAAADIKswMeJwgAAABlbmhhbmNlZAsAAAAyCp4EzBwIAAAAZW5oYW5jZWQJAAAAMgrpAzISAwAAAG5ld2EJAAAAMgrQAWASAwAAAG9sZGELAAAAMgrAAiQFBwAAAG92ZXJhbGxkHAAAAPsCgP4AAAAAAACQAQAAAAAEAgAQVGltZXMgTmV3IFJvbWFuAMDwGAA4l4x2gAGQdrQfZo4EAAAALQEDAAQAAADwAQIACwAAADIKYwVAIgcAAABzcGVlZHVwZAsAAAAyClMDXyIIAAAAZnJhY3Rpb24IAAAAMgo+BEYgAQAAAClyCwAAADIKPgQNGAgAAABmcmFjdGlvbggAAAAyCj4EmBUCAAAAKDEIAAAAMgpxAdIfAQAAADExCQAAADIKiQN6DwQAAAB0aW1lDAAAADIKiQNMCQoAAABleGVjdXRpb24gCQAAADIKcAGsDwQAAAB0aW1lDAAAADIKcAF+CQoAAABleGVjdXRpb24gCwAAADIKYAIuAAcAAABzcGVlZHVwbhwAAAD7AoD+AAAAAAAAkAEAAAACBAIAEFN5bWJvbAB2ZBMKjnjfWQDA8BgAOJeMdoABkHa0H2aOBAAAAC0BAgAEAAAA8AEDAAgAAAAyCj4EDCEBAAAAK3AIAAAAMgo+BPkWAQAAAC1wCAAAADIKYAJSFAEAAAA9cAgAAAAyCmACBggBAAAAPXAKAAAAJgYPAAoA/////wEAAAAAABwAAAD7AhAABwAAAAAAvAIAAAAAAQICIlN5c3RlbQCOtB9mjgAACgA4AIoBAAAAAAMAAADY8hgABAAAAC0BAwAEAAAA8AECAAMAAAAAAA==)

1. Using the program developed before: **program\_1.s**
2. Modify the processor architectural parameters related to multicycle instructions (Menu🡪Configure🡪Architecture) in the following way:
   * + - 1. Configuration 1

Starting from the *Initial Configuration*, change the FP addition latency to 3

* + - * 1. Configuration 2

Starting from the *Initial Configuration*, change the FP multiplier latency to 4

* + - * 1. Configuration 3

Starting from the *Initial Configuration*, change the FP division latency to 10

Compute both manually (using the Amdahl’s Law) and with the simulator the speed-up for any one of the previous processor configurations. Compare the obtained results and complete the following table.

Table 5: **program\_1.s speed-up computed by hand and by simulation**

|  |  |  |  |  |
| --- | --- | --- | --- | --- |
| **Proc. Config.**    **Speed-up comp.** | Initial config.  [c.c.] | Config. 1 | Config. 2 | Config. 3 |
| **By hand** |  |  |  |  |
| **By simulation** |  |  |  |  |