# SDK and Host Tutorial
## Vivado SDK
In the HLS part, we have generated the hardware IP. The hardware IP is integrated into the block design as PL. Therefore, we need the relevant PS to interact with PL.

We instantiate an IP `hw` and allocate memory addresses for input and output.

```C++
XFpgaconvnet_ip hw;

// Set IO addresses
XFpgaconvnet_ip_Initialize(&hw, XPAR_FPGACONVNET_IP_0_DEVICE_ID);
XFpgaconvnet_ip_Set_fpgaconvnet_in_0(&hw, (u32) input);
XFpgaconvnet_ip_Set_fpgaconvnet_out_0(&hw, (u32) output);
```

`cin` is the input. In the design, the input is sent from the host through UART. In the simple end-to-end example, the `coarse_in` parameter is set to 1; hence every input has 16 bits of "meaningful" data, with other bits set to `0`. In this case, the input is converted to the `u64` type, `OR`ed with a bit mask, and stored in an array.

`Xil_DCacheFlush()` flushes the cache; otherwise, the content will be wrongly written to memory.

```C++
// Parse and load input featuremap from UART
for (int i = 0; i < INPUT_SHAPE; i++) {
    std::cin >> in;
    input[i] = ((u64) std::stoi(in)) & 0x000000000000ffff;
}

usleep(100000);

// Flush cache
Xil_DCacheFlush();
```

After the input is loaded, we run the IP. The IP is run multiple times, and the running time is measured. Running `RUNS` times ensures a more precise measurement of time.

```C++
XTime t_start, t_end;
XTime_GetTime(&t_start);

for(int i = 0; i < RUNS; i++) {
    XFpgaconvnet_ip_Start(&hw);

    // Wait for IP to finish
    while (!XFpgaconvnet_ip_IsReady(&hw));
}

XTime_GetTime(&t_end);
```

Before the output is dumped, we first invalidate the cache to avoid any potential effect on the output. 

The `coarse_out` parameter is 4; hence every 64-bit word contains four unsigned 16-bit output values. In a byte-addressing system, each 16-bit value occupies two memory addresses. `output` represents the absolute address starting at which the output values are stored. For each output, the program reads from two memory addresses. The output is received by the host through UART, the same way the input is transmitted.

```C++
// Invalidate cache
Xil_DCacheInvalidate();

// Dump output
for (int i = 0; i < OUTPUT_SHAPE; i++) {
    // Load address, the processor is byte addressable, hence each address is 8 bits
    std::cout << Xil_In16(((u64) output) + 2 * i) << " ";
    usleep(50);
}
std::cout << '\n';
```

Eventually, we output the information of the execution and flush the cache again in preparation for following executions. 

```C++
float t_run = (t_end - t_start)* 1000000. /COUNTS_PER_SECOND;
std::cout<< "Completed " << RUNS << " Runs.  Time taken (us): " << (int) t_run << "  Rate (img/s): " << 1/t_run * RUNS * 1000000 << "\n";
Xil_DCacheFlush();
```

## Host Code 
The host code extracts the input image (in the simple end-to-end example, from the MNIST dataset), uses a hardware-accelerated method to run inference, and compares the result with the software method to verify the implementation.

Firstly, the host program identifies the index of the image in the dataset selected for comparison by extracting it from the command line. [`get_MNIST_image`](https://github.com/AlexMontgomerie/fpgaconvnet-tutorial/blob/main/tutorial/1_simple_end_to_end/hardware-tutorial-assets/host-code/tutorial_library.py#L7) grabs the MNIST image from the dataset and normalizes the data. Afterwards, the image is [sent](https://github.com/AlexMontgomerie/fpgaconvnet-tutorial/blob/main/tutorial/1_simple_end_to_end/hardware-tutorial-assets/host-code/tutorial_library.py#L42) to the FPGA using UART. `mnist_image` is a 4D array, with four dimensions indicating [Batch Size], [Channel], [Height], and [Width]. For the MNIST dataset, the first two parameters are 1. The data is multiplied by 256 for fixed-point calculation.

After the FPGA processes the data, [receive_string](https://github.com/AlexMontgomerie/fpgaconvnet-tutorial/blob/main/tutorial/1_simple_end_to_end/hardware-tutorial-assets/host-code/tutorial_library.py#L77) receives the message, and [receive_array](https://github.com/AlexMontgomerie/fpgaconvnet-tutorial/blob/main/tutorial/1_simple_end_to_end/hardware-tutorial-assets/host-code/tutorial_library.py#L85) receives the output data. The output data is flattened for display.

For software reference, the [run_inference](https://github.com/AlexMontgomerie/fpgaconvnet-tutorial/blob/main/tutorial/1_simple_end_to_end/hardware-tutorial-assets/host-code/tutorial_library.py#L37) function runs inference using `onnxruntime`, and the reference output data is flattened.

Unlike `mnist_image`, the 4D output of fpgaConvNet indicates [Batch Size], [Height], [Width], and [Channel], so comparison requires transposing the array. The two outputs are compared, and the mean squared error is computed. A small error verifies the functionality of the hardware method. 