## Convolution with Maxpool Demo Run
___
This notebook shows a single run of the convolution and maxpool using Darius IP. The input feature map is read from memory, processed and output feature map is captured for one single convolution with maxpool command. The cycle count and efficiency for the full operation is read and displayed at the end.   
The input data in memory is set with random integers in this notebook to test the run.

#### Terminology
| Term    | Description                              |
| :------ | :--------------------------------------- |
| IFM     | Input volume                             |
| Weights | A set of filter volumes                  |
| OFM     | Output volume                            |

#### Arguments
| Convolution Arguments | Description                              |
| --------------------- | ---------------------------------------- |
| ifm-h, ifm-w          | Height and width of an input feature map in an IFM volume |
| ifm-d                 | Depth of the IFM volume                  |
| kernel-h, kernel-w    | Height and width of the weight filters   |
| stride                | Stride for the IFM volume                |
| pad                   | Pad for the IFM volume                   |
| Channels              | Number of Weight sets/number of output feature maps |
| Pool-kernel-h, Pool-kernel-w | Height and width of maxpool kernel |
| Pool-stride           | Stride for the Convolved volume           |

### Block diagram

![](./images/darius_bd.png)
<center>Figure 1</center>

Figure 1 presents a simplified block diagram including Darius CNN IP that is used for running convolution tasks. The Processing System (PS) represents the ARM processor, as well as the external DDR. The Programmable Logic (PL) incorporates the Darius IP for running convolution tasks, and an AXI Interconnect IP. The AXI_GP_0 is an AXILite interface for control signal communication between the ARM and the Darius IP. The data transfer happens through the AXI High Performance Bus, denoted as AXI_HP_0. __For more information about the Zynq architecture, visit:__ [Link](https://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf)


### Dataflow   

The dataflow begins by creating an input volume, and a set of weights in python local memory. In Figure 1, these volumes are denoted as “ifm_sw”, and “weights_sw”, respectively. After populating random data, the ARM processor reshapes and copies these volumes into contiguous blocks of shared memory, represented as “ifm” and “weights” in Figure 1, using the “reshape_and_copy()” function. Once the data is accessible by the hardware, the PS starts the convolution operation by asserting the start bit of Darius IP, through the “AXI_GP_0” interface. Darius starts the processing by reading the “ifm” and “weights” volumes from the external memory and writing the results back to a pre-allocated location, shown as “ofm” in Figure 1.
Notes:
-	We presume the data in the “ifm_sw” and “weight_sw”, are populated in a row-major format. In order to get the correct results, these volumes have to be reshaped into an interleaved format, as expected by the Darius IP. 
-	No data reformatting is required for subsequent convolution calls to Darius, as it produces the “ofm” volume in the same format as it expects the “ifm” volume.
-	Since the shared memory region is accessible both by the PS and PL regions, one can perform any post-processing steps that may be required directly on the “ofm” volume without transferring data back-and-forth to the python local memory. 


### Step 1: Set the arguments for the convolution in CNNDataflow IP

In [1]:
# Input Feature Map (IFM) dimensions
ifm_height = 14
ifm_width = 14
ifm_depth = 64

# Kernel Window dimensions
kernel_height = 3
kernel_width = 3

# Other arguments
pad = 0
stride = 1

# Channels
channels = 32

# Maxpool dimensions
pool_kernel_height = 2
pool_kernel_width = 2
pool_stride = 2

print(
    "HOST CMD: CNNDataflow IP Arguments set are - IH %d, IW %d, ID %d, KH %d,"
    " KW %d, P %d, S %d, CH %d, PKH %d, PKW %d, PS %d"
    % (ifm_height, ifm_width, ifm_depth, kernel_height, kernel_width,
       pad, stride, channels, pool_kernel_height, pool_kernel_width, pool_stride))

HOST CMD: CNNDataflow IP Arguments set are - IH 14, IW 14, ID 64, KH 3, KW 3, P 0, S 1, CH 32, PKH 2, PKW 2, PS 2


### Step 2: Download `Darius Convolution IP` bitstream

In [2]:
from pynq import Overlay

overlay = Overlay(
    "/opt/python3.6/lib/python3.6/site-packages/pynq/overlays/darius/"
    "convolution.bit")
overlay.download()
print(f'Bitstream download status: {overlay.is_loaded()}')

Bitstream download status: True


### Step 3:  Create MMIO object to access the CNNDataflow IP
For more on MMIO visit: [MMIO Documentation](http://pynq.readthedocs.io/en/latest/overlay_design_methodology/pspl_interface.html#mmio)

In [3]:
from pynq import MMIO

# Constants
CNNDATAFLOW_BASEADDR = 0x43C00000
NUM_COMMANDS_OFFSET = 0x60
CMD_BASEADDR_OFFSET = 0x70
CYCLE_COUNT_OFFSET = 0xd0

cnn = MMIO(CNNDATAFLOW_BASEADDR, 65536)
print(f'Idle state: {hex(cnn.read(0x0, 4))}')

Idle state: 0x4


### Step 4: Create Xlnk object 
Xlnk object (Memory Management Unit) for allocating contiguous array in memory for data transfer between software and hardware

<div class="alert alert-danger">Note: You may run into problems if you exhaust and do not free memory buffers – we only have 128MB of contiguous memory, so calling the allocation twice (allocating 160MB) would lead to a “failed to allocate memory” error. Do a xlnk_reset() before re-allocating memory or running this cell twice  </div>

In [4]:
from pynq import Xlnk
import numpy as np

# Constant
SIZE = 5000000  # 20 MB of numpy.uint32s

mmu = Xlnk()

# Contiguous memory buffers for CNNDataflow IP convolution command, IFM Volume,
# Weights and OFM Volume. These buffers are shared memories that are used to 
# transfer data between software and hardware
cmd = mmu.cma_array(SIZE, dtype=np.int16)
ifm = mmu.cma_array(SIZE, dtype=np.int16)
weights = mmu.cma_array(SIZE, dtype=np.int16)
ofm = mmu.cma_array(SIZE, dtype=np.int16)

# Saving the base phyiscal address for the command, ifm, weights, and
# ofm buffers. These addresses will be used later to copy and transfer data 
# between hardware and software
cmd_baseaddr = cmd.physical_address
ifm_baseaddr = ifm.physical_address
weights_baseaddr = weights.physical_address
ofm_baseaddr = ofm.physical_address

### Step 5: Functions to print Xlnk statistics

In [5]:
def get_kb(mmu):
    return int(mmu.cma_stats()['CMA Memory Available'] // 1024)


def get_bufcount(mmu):
    return int(mmu.cma_stats()['Buffer Count'])


def print_kb(mmu):
    print("Available Memory (KB): " + str(get_kb(mmu)))
    print("Available Buffers: " + str(get_bufcount(mmu)))


print_kb(mmu)

Available Memory (KB): 91136
Available Buffers: 4


### Step 6: Construct convolution command
Check that arguments are in supported range and construct convolution command for hardware

In [6]:
from darius import darius_lib

conv_maxpool   = darius_lib.Darius(ifm_height, ifm_width, ifm_depth,
                                   kernel_height, kernel_width, pad, stride,
                                   channels, pool_kernel_height, 
                                   pool_kernel_width, pool_stride,
                                   ifm_baseaddr, weights_baseaddr,
                                   ofm_baseaddr)

IP_cmd = conv_maxpool.IP_cmd()

print("Command to CNNDataflow IP: \n" + str(IP_cmd))

All IP arguments are in supported range
Command to CNNDataflow IP: 
b'\x0e\x00\x0e\x00\x03\x00\x03\x00\x01\x00\x00\x00\x0c\x00\x0c\x00\x08\x00\x04\x00\x01\x00\x01\x00\x00\x000\x17 \x06\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x00p\x18\x90\x00\x00\x00\x00\x00\xd0\x17@\x02\x00\x00\x00\x12\x00\x00\x00\x00\x00\x00\x0c\x00\x0c\x00\x02\x00\x02\x00\x06\x00\x06\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'


### Step 7: Create IFM volume and weight volume.
Volumes are created in software and populated with random values in a row-major format.

In [7]:
from random import *

ifm_sw = np.random.randint(0,255, ifm_width*ifm_height*ifm_depth, dtype=np.int16)
weights_sw = np.random.randint(0,255, channels*ifm_depth*kernel_height*kernel_width, dtype=np.int16)

#### Reshape IFM volume and weights
Volumes are reshaped from row-major format to IP format and data is copied to their respective shared buffer

In [8]:
conv_maxpool.reshape_and_copy_ifm(ifm_sw, ifm)
conv_maxpool.reshape_and_copy_weights(weights_sw, weights)

### Step 8: Load  convolution command and start CNNDataflow IP

In [9]:
# Send convolution command to CNNDataflow IP
cmd_mem = MMIO(cmd_baseaddr, SIZE)
cmd_mem.write(0x0, IP_cmd)

# Load the number of commands and command physical address to offset addresses
cnn.write(NUM_COMMANDS_OFFSET, 1)
cnn.write(CMD_BASEADDR_OFFSET, cmd_baseaddr)

# Start Convolution if CNNDataflow IP is in Idle state
state = cnn.read(0x0)
if state == 4: # Idle state
    print("state: IP IDLE; Starting IP")
    start = cnn.write(0x0, 1) # Start IP
    start
else:
    print("state %x: IP BUSY" % state)

state: IP IDLE; Starting IP


#### Check status of the CNNDataflow IP

In [10]:
# Check if Convolution IP is in Done state
state = cnn.read(0x0)
if state == 6: # Done state
    print("state: IP DONE")
else:
    print("state %x: IP BUSY" % state)

state: IP DONE


### Step 9: Read back first few words of OFM

In [11]:
for i in range(0, 15, 4):
    print(hex(ofm[i]))

0x6aeb
0xe03
0x11d4
0x5c9c


### Step 10: Read cycle count and efficiency of the complete run

In [12]:
hw_cycles = cnn.read(CYCLE_COUNT_OFFSET, 4)
efficiency = conv_maxpool.calc_efficiency(hw_cycles)
print("CNNDataflow IP cycles: %d\nEffciency: %.2f%%" % (hw_cycles, efficiency))

CNNDataflow IP cycles: 44367
Effciency: 93.47%


#### Reset Xlnk

In [13]:
mmu.xlnk_reset()
print_kb(mmu)
print("Cleared Memory!")

Available Memory (KB): 129460
Available Buffers: 0
Cleared Memory!
