# Homework 3 (Vivado HLS) Report

## Introduction

This report presents an overview on the design optimization of 8 loop kernels as part of the Advanced Computer Architecture course (CS-470) at EPFL. It highlights the analysis of each kernel's naïve implementation followed by the optimized implementation. The explanations and comparative results (in terms of area and timing) for all the optimizations are also presented. Please note that the results are given in terms of clock cycles assuming a clock period of 10ns, and a summary of results can be found in the conclusion at the end.

# Kernel-1

The code for the loop kernel is shown below. The same code was used for the optimized version too.

```
1 #include "kernel1.h"
2
3 void kernel1( int array[ARRAY_SIZE] )
4 {
5    int i;
6    loop:for(i=0; i<ARRAY_SIZE; i++)
7         array[i] = array[i] * 5;
8 }
9</pre>
```

### **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive.
- 2. Each iteration is independent since it uses a different index i.
- 3. Hence, it was possible to achieve an initiation interval of 1.

#### Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The optimized version has the same trip count and iteration latency since the code was unchanged. However, the loop latency is 2 times better because of pipelining, which led to an improvement in the total latency from 2049 to 1026 cycles.

| Implementation | Total L | atency | Loop L | atency | Iteration | Trin Count | Initiation |
|----------------|---------|--------|--------|--------|-----------|------------|------------|
|                | Min     | Max    | Min    | Max    | Latency   | Trip Count | Interval   |
| Naïve          | 2049    | 2049   | 2048   | 2048   | 2         | 1024       | -          |
| Optimized      | 1026    | 1026   | 1024   | 1024   | 2         | 1024       | 1          |

**Utilization Estimates:** The optimized version uses 12 (10%) more LUTs but 8 (23%) less FFs compared to the naïve version. There is only a minor difference in area.

| Implementation | BRAN | AM_18K DSF |      | SP48E |      | FF |      | JT | URAM |   |
|----------------|------|------------|------|-------|------|----|------|----|------|---|
|                | Unit | %          | Unit | %     | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0    | 0          | 0    | 0     | 35   | ~0 | 115  | ~0 | 0    | 0 |
| Optimized      | 0    | 0          | 0    | 0     | 27   | ~0 | 127  | ~0 | 0    | 0 |

The code for the loop kernel is shown below. The code was rewritten for the optimized version.

```
#include "kernel2.h"

void kernel2( int array[ARRAY_SIZE] )

{
   int i;
   loop:for(i=3; i<ARRAY_SIZE; i++)
        array[i] = array[i-1] + array[i-2] * array[i-3];

}
</pre>
```

The optimized version of the code is shown below.

```
1 #include "kernel2.h"
 2
 3 void kernel2( int array[ARRAY_SIZE] )
 4 {
 5
       int prev[3] = {array[0], array[1], array[2]};
 6
       int result;
 7
 8
       loop:for(int i=3; i<ARRAY_SIZE; i++)</pre>
 9
10
           result = prev[2] + prev[1] * prev[0];
11
12
           prev[0] = prev[1];
13
           prev[1] = prev[2];
14
           prev[2] = result;
15
           array[i] = result;
16
17
       }
18 }
19
```

#### **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive. But this is not sufficient.
- 2. The code was rewritten to reduce memory reads to zero for each iteration. This was done by using registers to save the previous three results which can be used instead of *array[i-1]*, *array[i-2]* and *array[i-3]* reads in every iteration. Finally, the result was written into *array[i]*.
- 3. Hence, it was possible to achieve an initiation interval of 1.

#### Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The optimized version has the same trip count but a better iteration latency after rewriting the code. The loop latency is 5 times better because of pipelining, which led to an improvement in the total latency from 5106 to 1025 cycles.

| Implementation | Total L | atency | Loop L | atency | Iteration | Trin Count | Initiation |
|----------------|---------|--------|--------|--------|-----------|------------|------------|
| implementation | Min     | Max    | Min    | Max    | Latency   | Trip Count | Interval   |
| Naïve          | 5106    | 5106   | 5105   | 5105   | 5         | 1021       | -          |
| Optimized      | 1025    | 1025   | 1021   | 1021   | 2         | 1021       | 1          |

**Utilization Estimates:** The optimized version uses the same number of DSP48Es, but 45 (31%) more FFs and 31 (14%) more LUTs compared to the naïve version. There is only a minor difference in area.

| Implementation | BRAN | RAM_18K D |      | DSP48E |      | FF |      | JT | URAM |   |
|----------------|------|-----------|------|--------|------|----|------|----|------|---|
|                | Unit | %         | Unit | %      | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0    | 0         | 3    | ~0     | 145  | ~0 | 222  | ~0 | 0    | 0 |
| Optimized      | 0    | 0         | 3    | 1      | 190  | ~0 | 253  | ~0 | 0    | 0 |

# Kernel-3

The code for the loop kernel is shown below. The same code was used for the optimized version too.

```
#include "kernel3.h"

void kernel3( float hist[ARRAY_SIZE], float weight[ARRAY_SIZE], int index[ARRAY_SIZE])

{
    loop:for (int i=0; i<ARRAY_SIZE; ++i) {
        hist[index[i]] = hist[index[i]]+ weight[i];
    }

}</pre>
```

## **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive. But this is not sufficient.
- 2. The iterations are not independent. There may be a RAW data dependency on the *hist* array in the case where at least 2 *index[i]* in the subsequent 7 iterations have the same value.
- 3. A perfect initiation interval could not be achieved because of the above unpredictability. It cannot be ignored, but we can use a complex logic to check the subsequent 7 *index[i]* values and accumulate the *weight* if necessary. But it will be very expensive in terms of area.
- 4. Hence, it was only possible to achieve an initiation interval of 7.

### Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The optimized version has the same trip count and iteration latency since the code was unchanged. However, the loop latency is a little better because of pipelining, which led to an improvement in the total latency from 8193 to 7170 cycles.

| Implementation | Total L | atency | Loop L | atency | Iteration | Trip Count | Initiation |
|----------------|---------|--------|--------|--------|-----------|------------|------------|
| implementation | Min     | Max    | Min    | Max    | Latency   | Trip Count | Interval   |
| Naïve          | 8193    | 8193   | 8192   | 8192   | 8         | 1024       | -          |
| Optimized      | 7170    | 7170   | 7168   | 7168   | 8         | 1024       | 7          |

**Utilization Estimates:** The optimized version uses the same number of DSP48Es but 3 (1%) more FFs and 24 (7.6%) more LUTs compared to the naïve version. There is only a minor difference in area.

| Implementation | BRAN | BRAM_18K |      | DSP48E |      | FF |      | JT | URAM |   |
|----------------|------|----------|------|--------|------|----|------|----|------|---|
|                | Unit | %        | Unit | %      | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0    | 0        | 2    | ~0     | 364  | ~0 | 316  | ~0 | 0    | 0 |
| Optimized      | 0    | 0        | 2    | ~0     | 367  | ~0 | 340  | ~0 | 0    | 0 |

The code for the loop kernel is shown below. The code was rewritten for the optimized version.

```
#include "kernel4.h"

void kernel4(int array[ARRAY_SIZE], int index[ARRAY_SIZE], int offset)

{
    loop:for (int i=offset+1; i<ARRAY_SIZE-1; ++i)
    {
        array[offset] = array[offset]-index[i]*array[i]+index[i]*array[i+1];
    }
}</pre>
```

The optimized version of the code is shown below.

```
1 #include "kernel4.h"
 3evoid kernel4(int array[ARRAY_SIZE], int index[ARRAY_SIZE], int offset)
 4 {
 5
       int sum = 0;
 6
       loop:for (int i=offset+1; i<ARRAY_SIZE-1; ++i)</pre>
 7
 8
       {
 9
           sum = sum + index[i] * (array[i+1] - array[i]);
10
       }
11
12
       array[offset] = array[offset] + sum;
13 }
14
```

## **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive. But this is not sufficient.
- 2. The code was rewritten to avoid the RAW dependency which may arise from writing into array[offset] and reading array[i] or array[i+1] at the same time.
- 3. Further, the final result just needs to be written in one location, hence, it can be pulled out of the loop to reduce memory operations in the loop. The value is accumulated in *sum* and finally added to *array[offset]* in the end.
- 4. Hence, it was possible to achieve an initiation interval of 1.

#### Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The code has variable loop boundaries; hence, we assume a trip count of N (range: 1 to 1022) to calculate the loop latency (5N: naïve, N+2: optimized) and total latency (5N+1: naïve, N+5: optimized) based on the synthesis schedule. The optimized version has the same trip count but a better iteration latency after rewriting the code. The loop latency is 5 times better because of pipelining, which led to an improvement in the total max latency from 5111 to 1027 cycles.

| Implementation | Total Latency |      | Loop L | atency | Iteration | Trin Count | Initiation |
|----------------|---------------|------|--------|--------|-----------|------------|------------|
|                | Min           | Max  | Min    | Max    | Latency   | Trip Count | Interval   |
| Naïve          | 6             | 5111 | 5      | 5110   | 5         | N          | -          |
| Optimized      | 6             | 1027 | 3      | 1024   | 3         | N          | 1          |

**Utilization Estimates:** The optimized version uses 3 (50%) less DSP48Es and 93 (34%) less FFs, but 69 (25%) more LUTs compared to the naïve version. There is only a minor difference in area.

| Implementation | BRAN | 1_18K | DSP  | 48E | FF   |    | LUT  |    | URAM |   |
|----------------|------|-------|------|-----|------|----|------|----|------|---|
|                | Unit | %     | Unit | %   | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0    | 0     | 6    | 1   | 272  | ~0 | 279  | ~0 | 0    | 0 |
| Optimized      | 0    | 0     | 3    | ~0  | 179  | ~0 | 348  | ~0 | 0    | 0 |

# Kernel-5

The code for the loop kernel is shown below. The code was rewritten for the optimized version.

```
1 #include "kernel5.h"
 3 float kernel5(float bound, float a[ARRAY_SIZE], float b[ARRAY_SIZE])
 4 {
 5
       int i=0;
 6
       float sum;
 7
       loop:while (sum<bound && i<ARRAY_SIZE)</pre>
 8
 9
            sum = a[i] + b[i];
10
           i++;
11
       }
12
       return sum;
13 }
14
```

The optimized version of the code is shown below.

```
1 #include "kernel5.h"
 2 #define len 8
 40 float kernel5(float bound, float a[ARRAY_SIZE], float b[ARRAY_SIZE])
 6
       float sum[len];
 7
       bool flag[len];
 8
       fill:for (int i=0; i<len; i++)
 9
10
           int idx = len-i-1;
11
           sum[idx] = a[i] + b[i];
12
           flag[idx] = (sum[idx] >= bound);
13
       }
14
15
       loop:for (int i=len; i<ARRAY SIZE; i++)</pre>
16
       {
17
           if (flag[len-1])
18
                break:
19
           shift1:for (int j=len-1; j>0; j--)
20
21
                sum[j] = sum[j-1];
22
23
           shift2:for (int j=len-1; j>2; j--)
24
                flag[j] = flag[j-1];
25
           sum[0] = a[i] + b[i];
26
27
           flag[2] = (sum[2] >= bound);
28
       }
29
30
       return sum[len-1];
31 }
```

## **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive. But this is not sufficient.
- 2. The code was rewritten to remove the control dependency on *sum* which decides the exit condition of the loop. This was done using a shift queue of length 8 to store the *sum* at the end, and check the exit condition using the oldest value which is ready. The shifting loop was unrolled using the unroll directive and *sum* was partitioned using the partition directive.
- 3. Next, another queue was used to store the exit condition *flag* and this computation runs a little behind the *sum* computation (3 cycles: based on the comparison latency), effectively using a 6-length queue. The shifting loop was unrolled and *flag* was partitioned, like *sum*.
- 4. Another loop was added before the main loop to initialize the shift queues for *sum* and *flag*. This loop was pipelined using the pipeline directive, achieving an iteration interval of 1.
- 5. Finally, the intra-loop dependence for *flag* was explicitly set to true using the dependence directive so that it is always executed in sequence within an iteration.
- 6. Hence, it was possible to achieve an initiation interval of 1.

## Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The code has variable loop boundaries; hence, we assume a trip count of N (range: 1 to 1024) to calculate the loop latency (7N: naïve) and total latency (7N+1: naïve) based on the synthesis schedule. The estimates for the optimized version are provided by Vivado HLS, as shown below, for the main loop and the *fill* loop.

The optimized version has similar trip count and iteration latency, in different conditions. However, the loop latency is 7 times better because of pipelining and re-design, which led to an improvement in the max total latency from 7168 to 1039 cycles.

| luculous autation | Total L | atency | Loop L | atency | Iteration | Trin Count         | Initiation |
|-------------------|---------|--------|--------|--------|-----------|--------------------|------------|
| Implementation    | Min     | Max    | Min    | Max    | Latency   | Latency Trip Count |            |
| Naïve             | 8       | 7168   | 7      | 7168   | 7         | N                  | -          |
| Optimized         | 24      | 1039   | 6      | 1021   | 7         | 1~1016             | 1          |
| (fill loop)       | -       | -      | 14     | 14     | 8         | 8                  | 1          |

**Utilization Estimates:** The optimized version uses the same number of DSP48Es, but 1037 (225%) more FFs and 530 (110%) more LUTs compared to the naïve version. The performance improvement came at a high cost in area.

| Implementation | BRAN | 1_18K | DSP48E |    | FF   |    | LUT  |    | URAM |   |
|----------------|------|-------|--------|----|------|----|------|----|------|---|
|                | Unit | %     | Unit   | %  | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0    | 0     | 2      | ~0 | 462  | ~0 | 478  | ~0 | 0    | 0 |
| Optimized      | 0    | 0     | 2      | ~0 | 1499 | ~0 | 1008 | ~0 | 0    | 0 |

The code for the loop kernel is shown below. The code was rewritten for an alternate version.

```
#include "kernel6.h"

int kernel6(int x)

int i=0;

loop:while(i*i < x)

i++;

return i;

}</pre>
```

The alternate version of the code is shown below.

```
1 #include "kernel6.h"
 3 int kernel6(int x)
 4 {
 5
       int i=0;
       loop:for (i=0; ; i++)
 6
 7
           if (i*i >= x)
 8
 9
                break;
10
       }
11
12
       return i;
13 }
14
```

#### **Optimization Steps**

The naïve implementation performs just fine (see table). The following observations and changes led to an alternate design:

- 1. The naïve implementation is decent because the loop has an iteration latency of just 1.
- 2. The loop can be pipelined using the pipeline directive, but this is not really necessary.
- 3. The code was rewritten in an attempt to apply directives and remove the control dependency. But even without it, the performance is optimal because of cheap operations within the loop.
- 4. Hence, it was possible to achieve an initiation interval of 1.

#### Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The code has variable loop boundaries; hence, we assume a trip count of N (infinite range) to calculate the loop latency (N: naïve, N: optimized) and total latency (N+1: naïve, N+1: optimized) based on the synthesis schedule. Both the versions are similar in design and they give the same performance.

| Implementation | Total L | atency | Loop L | atency | Iteration | Trip Count | Initiation |
|----------------|---------|--------|--------|--------|-----------|------------|------------|
| implementation | Min     | Max    | Min    | Max    | Latency   | Trip Count | Interval   |
| Naïve          | 1       | (N+1)  | 0      | (N)    | 1         | N          | -          |
| Optimized      | 1       | (N+1)  | 0      | (N)    | 1         | N          | 1          |

**Utilization Estimates:** The optimized version uses the same number of DSP48Es, 1 (3%) more FF and 6 (7%) more LUTs compared to the naïve version. There is only a minor difference in area.

| Implementation | BRAN | BRAM_18K |      | DSP48E |      | FF |      | LUT |      | URAM |  |
|----------------|------|----------|------|--------|------|----|------|-----|------|------|--|
|                | Unit | %        | Unit | %      | Unit | %  | Unit | %   | Unit | %    |  |
| Naïve          | 0    | 0        | 3    | ~0     | 34   | ~0 | 102  | ~0  | 0    | 0    |  |
| Optimized      | 0    | 0        | 3    | ~0     | 35   | ~0 | 108  | ~0  | 0    | 0    |  |

The code for the loop kernel is shown below. The same code was used for the optimized version too.

```
1 #include "kernel7.h"
 30 float kernel7(float a[ARRAY_SIZE], float b[ARRAY_SIZE])
 4 {
 5
       float sum = 0;
       loop:for(int i=0; i<ARRAY_SIZE; i++)</pre>
 6
 7
 8
            float diff = a[i] - b[i];
 9
            if (diff > 0)
10
                sum = (sum + diff);
11
12
       return sum;
13 }
14
```

## **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive. But this is not sufficient.
- 2. There is a control dependency on *diff* followed by a data dependency on *sum* which stalls the pipeline. This prevents from achieving a perfect initiation interval.
- 3. The problem cannot be solved by simple predication as it will involve a float multiplication with the predicate, which is a costly operation, and hence, worsens the performance.
- 4. It could be possible to improve the performance at a high cost of area by using a queue to store the *diff*. It will require complex logic, an inefficient loop to fill the queue and an unrolled loop to shift the queue items. But it will still have the control dependency due to the top item.
- 5. Hence, it was only possible to achieve an initiation interval of 4.

#### Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The optimized version has the same trip count and iteration latency since the code was unchanged. However, the loop latency is 2.5 times better because of pipelining, which led to an improvement in the total latency from 10241 to 4104 cycles.

| Implementation | Total Latency |       | Loop L | atency | Iteration | Trip Count | Initiation |  |
|----------------|---------------|-------|--------|--------|-----------|------------|------------|--|
|                | Min           | Max   | Min    | Max    | Latency   | Trip Count | Interval   |  |
| Naïve          | 10241         | 10241 | 10240  | 10240  | 10        | 1024       | -          |  |
| Optimized      | 4104          | 4104  | 4102   | 4102   | 11        | 1024       | 4          |  |

**Utilization Estimates:** The optimized version uses the same number of DSP48Es, 63 (14%) more FFs and 41 (8%) more LUTs compared to the naïve version. There is only a minor difference in area.

| Implementation | BRAM_18K |   | DSP48E |    | FF   |    | LUT  |    | URAM |   |
|----------------|----------|---|--------|----|------|----|------|----|------|---|
|                | Unit     | % | Unit   | %  | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0        | 0 | 2      | ~0 | 457  | ~0 | 500  | ~0 | 0    | 0 |
| Optimized      | 0        | 0 | 2      | ~0 | 520  | ~0 | 541  | ~0 | 0    | 0 |

The code for the loop kernel is shown below. The code was rewritten for the optimized version.

```
#include "kernel8.h"

void kernel8(int array[ARRAY_SIZE], int multiplier, int offset)

{
    loop:for (int i=6; i<ARRAY_SIZE-1-offset; ++i)
    {
        array[i] = array[i-6+offset]*multiplier;
    }
}</pre>
```

The optimized version of the code is shown below.

```
1 #include "kernel8.h"
 2
 3@void kernel8(int array[ARRAY_SIZE], int multiplier, int offset)
 4 {
 5
       int tmp[4] = {array[2], array[3], array[4], array[5]};
       int flag = (offset > 2) & (offset < 5);</pre>
 6
 7
       int result, rd_val;
 8
       loop:for (int i=6; i<ARRAY_SIZE-1-offset; i++)</pre>
 9
10
11
            if (flag)
12
                rd_val = tmp[offset-2];
13
            else
14
                rd_val = array[i-6+offset];
15
16
            tmp[0] = tmp[1];
17
            tmp[1] = tmp[2];
18
            tmp[2] = tmp[3];
19
            tmp[3] = rd_val * multiplier;
20
21
            array[i] = tmp[3];
22
       }
23 }
24
```

#### **Optimization Steps**

The naïve implementation does not give very good results (see table). The following observations and changes led to an optimized design:

- 1. The loop can be pipelined using the pipeline directive. But this is not sufficient.
- 2. The code was rewritten to remove the inter-loop dependency. The calculation and memory write requires 4 cycles, and hence, the dependency on *array[i]* read can occur only from a previous *array[i-1]* to *array[i-4]* write. Hence, these previous results are saved in registers.
- 3. The registers are used conditionally: if 1 < 6 offset < 4, i.e., 2 < offset < 5, which can be detected beforehand. In this limited case, the limited number of extra registers can be used. Otherwise, it is safe to directly access the array freely: inter dependence was thus disabled.
- 4. Hence, it was possible to achieve an initiation interval of 1.

## Synthesis Comparison

The synthesis reports for both the implementations can be compared in terms of timing and area:

**Performance Estimates:** The code has variable loop boundaries; hence, we assume a trip count of N (range: 1 to 1017) to calculate the loop latency (4N: naïve, N+3: optimized) and total latency (4N+1: naïve, N+6: optimized) based on the synthesis schedule. The optimized version has the same trip count and iteration latency. However, the loop latency is 4 times because of pipelining and re-design, which led to an improvement in the max total latency from 4069 to 1023 cycles.

| Implementation | Total Latency |      | Loop L | atency | Iteration | Trin Count | Initiation |  |
|----------------|---------------|------|--------|--------|-----------|------------|------------|--|
|                | Min           | Max  | Min    | Max    | Latency   | Trip Count | Interval   |  |
| Naïve          | 5             | 4069 | 4      | 4068   | 4         | N          | -          |  |
| Optimized      | 7             | 1023 | 4      | 1020   | 4         | N          | 1          |  |

**Utilization Estimates:** The optimized version uses the same number of DSP48Es, but 359 (272%) more FF and 339 (135%) more LUTs compared to the naïve version. The performance improvement came at a high cost in area.

| Implementation | BRAM_18K |   | DSP48E |    | FF   |    | LUT  |    | URAM |   |
|----------------|----------|---|--------|----|------|----|------|----|------|---|
|                | Unit     | % | Unit   | %  | Unit | %  | Unit | %  | Unit | % |
| Naïve          | 0        | 0 | 3      | ~0 | 132  | ~0 | 250  | ~0 | 0    | 0 |
| Optimized      | 0        | 0 | 3      | ~0 | 491  | ~0 | 589  | ~0 | 0    | 0 |

## Conclusion

Finally, the design optimizations led to a perfect performance improvement with II=1 in most kernels (1, 2, 4, 5, 8), and a little performance improvement with II > 1 in other kernels (3, 7). Only kernel (6) performed good enough without any optimizations in the naïve implementation because of a 1-cycle latency. All the improvements, except for kernels (5, 8), were achieved with only a minor cost in area.

Please note that all the optimized kernels have been simulated and synthesized successfully. The optimizations were mostly achieved by rewriting the code logic, using the pipeline directive and occasionally, using the dependence directive. The unroll directive, and consequently, the partition directive were used sometimes to implement shift-register logic. The use of these directives could also improve the performance for some *kernels* (3, 7), but it will require very complex logic and very high cost in area. Hence, this approach was avoided.

Along with this report, the project files, naïve and optimized, are attached as two separate zip-files. The C++ files and HTML synthesis reports, naïve and optimized, are also attached separately.

Submitted by: Kushagra Shah (316002)

Dated: 15/05/2021