# Report: SoC lab4-2

### 1. Introduction:

- \* There are two brams in the current design. One is for coefficients and the other is for the X value and the ap control values.
- \* The briefing data flow is as follows:

After reading one X, the Y could start be generated, and the X would be paused to generate till the Y has be computed.

Now let's turn to the firmware and accelerator interface. The data X and coefficients have been sent from firmware to hardware. Before sending the X value, firmware has to make sure the accelerator is able to get the data, and so does the Y output. Here, I create 2 registers to check the availability, fir\_begin\_send\_x and fir\_able\_receive\_y.

### 2. Area and timing report:



| tup                            |         | Hold                         |          | Pulse Width                              |          |
|--------------------------------|---------|------------------------------|----------|------------------------------------------|----------|
| Worst Negative Slack (WNS): 0. | .938 ns | Worst Hold Slack (WHS):      | 0.139 ns | Worst Pulse Width Slack (WPWS):          | 0.250 ns |
| Total Negative Slack (TNS): 0  | .000 ns | Total Hold Slack (THS):      | 0.000 ns | Total Pulse Width Negative Slack (TPWS): | 0.000 ns |
| Number of Failing Endpoints: 0 | 1       | Number of Failing Endpoints: | 0        | Number of Failing Endpoints:             | 0        |
| Total Number of Endpoints: 1   | .098    | Total Number of Endpoints:   | 1098     | Total Number of Endpoints:               | 345      |

## 3. Waveform and analysis:

### ap\_start:



### Xin:



#### Yout:



4. Let's say the number of Yout we want to produce is N. The total Xin is approximate to 11N. The design trick used in this design is that the bram we use to store the 11 necessary X for computing a Y is 10(one of the bram block we use to store the ap signals), therefore, we have to stall there while the current Y has not been produced. The latency would be a bit longer compared to the one used the bram12. The throughput is approximately 1 OP per cycle.

The another way to improve the performance is to add some buffers to prefetch the X so that we could decrease the stall time.

One thing want to mentioned is that the design of firmware sometimes would also important for performance while integrated to the hardware accelerator. The handshake mechanism could also be improved in order to enhance the performance.