# ARRAY INPUTS - MUL - OPTIMIZATIONS
In this notebook we explain how to use ```pynq``` framework to test the acceleration (optimized) of matrices multiplication, element by element.

In [None]:
import datetime
from pynq import Overlay
from pynq import DefaultIP
from pynq import DefaultHierarchy
from pynq import allocate
from pynq import MMIO
from pynq.pl import *
import pynq.lib.dma
import numpy as np
import time

These variables are the addresses where variables visible to FPGA are mapped. In the function to synthesize in Vivado HLS they are passed as parameters.
 - ```XMUL_MATRIX_AXILITES_ADDR_X_DATA``` is the address;
 - ```XMUL_MATRIX_AXILITES_BITS_X_DATA``` is the 32-bit alignment of the registers. 

In [None]:
XMUL_MATRIX_AXILITES_ADDR_A_DATA = 0x10
XMUL_MATRIX_AXILITES_BITS_A_DATA = 32
XMUL_MATRIX_AXILITES_ADDR_B_DATA = 0x18
XMUL_MATRIX_AXILITES_BITS_B_DATA = 32
XMUL_MATRIX_AXILITES_ADDR_C_DATA = 0x20
XMUL_MATRIX_AXILITES_BITS_C_DATA = 32

The function initializes the hardware of FPGA building an object that contains synthesized module (```ol```), which contains all infos to execute IP module, and a reference to IP (```ip```).

In [None]:
def init_hw(filepath):
    global ol, ip
    ol = Overlay(filepath)
    ip = ol.matrix_mul_0

In [None]:
init_hw("/path/to/dot_design_1.bit")
ol?

In this block the variables that are needed later are allocated and initialized. This specifies the allocation of the variables where the size and their type must be specified as written in Vivado HLS. The suggestion is to use ```numpy```.

In [None]:
DIM = 256

a = allocate(shape=((DIM, DIM)), dtype=np.int32, cacheable=True)
b = allocate(shape=((DIM, DIM)), dtype=np.int32, cacheable=True)
c = allocate(shape=((DIM, DIM)), dtype=np.int32, cacheable=True)

a[:] = np.ones((DIM,DIM)).astype('int') * 3
b[:] = np.ones((DIM,DIM)).astype('int') * 3
c[:] = np.zeros((DIM,DIM)).astype('int')

With ```ip.write(0x00, 4)``` instruction, the FPGA is put in ```idle``` state, writing value ```4``` in the control registry (```0x00```).

In [None]:
ip.write(0x00, 4)
fpga_state = ip.read(0x00)

Now we get the physical addresses of the previously allocated variables. If the FPGA is in ```idle``` state (```4```), then we write in the registers of the IP module the values of the arrays to be passed for execution.

In [None]:
a_p_ptr = a.physical_address
b_p_ptr = b.physical_address
c_p_ptr = c.physical_address

ip.write(0x00, 4)

if fpga_state == 4:
    ip.write(XMUL_MATRIX_AXILITES_ADDR_A_DATA, a_p_ptr)
    ip.write(XMUL_MATRIX_AXILITES_ADDR_B_DATA, b_p_ptr)
    ip.write(XMUL_MATRIX_AXILITES_ADDR_C_DATA, c_p_ptr)
else:
    print("Can't write values, must be in IDLE state")
    raise KeyboardInterrupt

With ```ip.write(0x00, 1)``` we write ```1``` in control register, that starts the execution of the IP module. Another one time is saved the FPGA state that, at the end of execution it will be ```4``` (```idle```) or ```6``` (```done```). After the ```while``` cycle, we save the result with a simply assignment.

In [None]:
%%timeit

ip.write(0x00, 1)
fpga_state = ip.read(0x00)

max_try = 1000000
while fpga_state != 6 and fpga_state != 4:
    fpga_state = ip.read(0x00)
    max_try = max_try -1
    if max_try == 0:
        print("ERROR: Can't go ahead")
        ip.write(0x00, 4)
        raise KeyboardInterrupt
        
ip.write(0x00, 4)