BBB_cci_gemm

FPGA GEMM IP Overview

The GEMM IP for Intel FPGA is based on 2-D systolic array architecture. The GEMM Core consists of processing elements (PE) laid out in a two dimensional array. Each PE is responsible for doing a dot product of two vectors. Note that the number of elements in these vectors depends on the data type being used in the GEMM. For FP-32 mode, each PE is responsible for doing the dot product of two 8 element vectors. For INT-16 mode, each PE is responsible for doing the dot product of two 16 element vectors. For INT-8 mode, each PE is responsible for doing the dot product of two 32 element vectors. Along with performing the dot product, the PEs are also responsible for propagating the vectors to its neighbors (both above and to the right). In all, there are 160 PEs organized in 10 rows and 16 columns.

Figure 1 shows the various building blocks that are connected to the GEMM IP. The GEMM IP is directly connected to the MPF. The MPF is a basic building block that provides the virtual to physical address translation services for the AFU using a TLB in the hardware. The MPF is connected to the CCI-P interface. The Async FIFO shim is an optional block that can be used between the MPF and CCI-P interface in order to allow the GEMM IP to run at a lower frequency than that of the CCI-P interface.

Figure 1: GEMM IP with CCI-P and MPF

Figure 2 shows the systolic array architecture of the GEMM IP. The elements of matrix A are sent in chunks to the row feeders and elements of matrix B are sent in chunks to the column feeders. All the PEs are arranged in a 2-D array of 10x16, resulting in 160 PEs. Note that the chunk size fed into the PEs for a given matrix multiplication is selected based on control and status registers (CSRs).

The chunk size used for matrix A and matrix B can be different. However, the chunk size for all PEs fed from matrix A is the same, and the chunk size for all PEs fed from matrix B is the same. Because of this, each PE produces an equal sized section of the final result matrix. The specific size of this section (as determined by the CSRs) is equal to the chunk size of A by the chunk size of B. For example, if the chunk size of A is set to 8 and the chunk size of B is set to 10, then each PE would calculate an 8x10 section of the final result matrix.

Note that there are some restrictions on the allowed chunk sizes. The minimum allowed chunk size for matrix A is 2, the minimum allowed chunk size for matrix B is 5, and the maximum allowed chunk size for both matrices is 32. Furthermore, the chunk size of A multiplied by the chunk size of B must be greater than 50. This means that if the chunk size of A is set to 5, the chunk size of B cannot be set to 10 (but could be set to 11).

Lastly, since there are 10 rows of PEs and the minimum chunk size of A is 2, the minimum size for the number of rows in matrix A is 20 (2 times 10 rows). Similarly, since we have 16 columns of PEs and the minimum chunk size of B is 5, the minimum size for the number of columns in matrix B is 80 (5 times 16 columns). However, note that due to the limitations of the chunk size values mentioned before, it is not possible to use the minimum row size of A in conjunction with the minimum column size of B.

Figure 2: Systolic Array of PEs

Control and Status Registers

The functionality of the GEMM IP is managed through the use of control and status registers (CSRs) that are exposed through the MMIO interface of the CCI-P.

Table 1: GEMM CSR and its description

CSR Name	Offset	Size	CCI-P Mandatory	Description
CSR_AFH_DFH_BASE	0x000	64 bits	Yes	Device feature header
CSR_AFH_ID_L	0x008	64 bits	Yes	Lower 64 bits of the GEMM GUID
CSR_AFH_ID_H	0x010	64 bits	Yes	Upper 64 bits of GEMM GUID
CSR_AFU_DSM_BASE	0x100	64 bits	GEMM Specific	AFU DSM base address
CSR_VERSION	0x110	64 bits	GEMM Specific	Systolic GEMM Version
CSR_CTL	0x118	64 bits	GEMM Specific	Control CSR for controlling operation of the GEMM IP
CSR_CFG	0x120	64 bits	GEMM Specific	Configures the write type and read type used through CCI-P
CSR_SRC_ADDR_A	0x128	64 bits	GEMM Specific	Pointer to the base address of matrix A
CSR_SRC_ADDR_B	0x130	64 bits	GEMM Specific	Pointer to the base address of matrix B
CSR_DST_ADDR_C	0x138	64 bits	GEMM Specific	Pointer to the base address of matrix C
CSR_NUM_BLOCKS	0x140	64 bits	GEMM Specific	Number of blocks in a workload
CSR_NUM_PARTS_A	0x148	64 bits	GEMM Specific	Number of partitions in matrix A
CSR_NUM_PARTS_B	0x150	64 bits	GEMM Specific	Number of partitions in matrix B
CSR_NUM_PARTS_C	0x158	64 bits	GEMM Specific	Number of partitions in matrix C
CSR_NUM_ROWS_X_NUM_BLOCKS	0x160	64 bits	GEMM Specific	Software must write into this CSR the product of the number of rows (10) in the GEMM core and the number of blocks represented by CSR_NUM_BLOCKS
CSR_NUM_COL_X_NUM_BLOCKS	0x168	64 bits	GEMM Specific	Software must write into this CSR the product of number of columns (16) in the GEMM core and number of blocks represented by CSR_NUM_BLOCKS
CSR_NUM_CACHE_LINES_C	0x170	64 bits	GEMM Specific	Software must write into this CSR the number of cache lines that comprise matrix C. This value is used by the HW to check the condition for GEMM computation complete.
CSR_CHUNK_SIZE_A	0x178	64 bits	GEMM Specific	Number of vectors contained within a single chunk of matrix A
CSR_CHUNK_SIZE_B	0x180	64 bits	GEMM Specific	Number of vectors contained within a single chunk of matrix B
CSR_GROUP_SIZE	0x188	64 bits	GEMM Specific	Number of chunks contained within a single group

CSR_AFH_DFH_BASE

Table 2: CSR_AFH_DFH_BASE description

Bit	Attribute	Default	Description
63:60	Read only	0x1	Type: AFU
59:52	RSVD	0x0	Reserved
51:48	Read only	0x0	AFU Minor version
47:41	RSVD	0x0	Reserved
40	Read only	NA	End of list 1’b0 - There is another feature header beyond this 1’b1 - This is the last feature for this AFU
39:16	Read only	0x0	Byte offset to the Next device feature Header. For MPF, the byte offset address is 0x1000
15:12	Read only	NA	AFU Major version
11:0	Read only	0x070	CCI-P version

AFU GUID

The AFU GUID is the architectural interface/contract that the AFU makes with the SW. The same GEMM IP supports 3 modes (each with its own bitstream).

GEMM FP-32: 64f6fa35-6025-4e72-ad92-15c3-a431-73a9
GEMM INT-16: 311791dc-97e9-4783-87b7-0d33-b119-0613
GEMM INT-8: da52758f-3f2a-45c1-89de-7762-7064-30ea

GEMM FP-32

CSR_AFH_ID_L

CSR_AFH_ID_L stores the lower 64 bit of the GEMM FP-32 AFUID

Table 3: CSR_AFH_ID_L description for GEMM FP-32

Bits	Attribute	Default	Description
63:0	Read only	0	64’h AD92_15C3_A431_73A9

CSR_AFH_ID_H

CSR_AFH_ID_H stores the higher 64 bit of GEMM FP-32 AFUID

Table 4: CSR_AFH_ID_H description for GEMM FP-32

Bits	Attribute	Default	Description
63:0	Read only	0	64’h64F6_FA35_6025_4E72

GEMM INT-16

CSR_AFH_ID_L

CSR_AFH_ID_L stores the lower 64 bit of the GEMM INT-16 AFUID

Table 5: CSR_AFH_ID_L description for INT-16

Bits	Attribute	Default	Description
63:0	Read only	0	64'h87B7_0D33_B119_0613

CSR_AFH_ID_H

CSR_AFH_ID_H stores the higher 64 bit of GEMM INT-16 AFUID

Table 6: CSR_AFH_ID_H description for INT-16

Bits	Attribute	Default	Description
63:0	Read only	0	64'h3117_91DC_97E9_4783

GEMM INT-8

CSR_AFH_ID_L

CSR_AFH_ID_L stores the lower 64 bit of the GEMM INT-8 AFUID

Table 7: CSR_AFH_ID_L description for INT-8

Bits	Attribute	Default	Description
63:0	Read only	0	64'hDA52_758F_3F2A_45C1

CSR_AFH_ID_H

CSR_AFH_ID_H stores the higher 64 bit of GEMM INT-8 AFUID

Table 8: CSR_AFH_ID_H description for INT-8

Bits	Attribute	Default	Description
63:0	Read only	0	64'h89DE_7762_7064_30EA

CSR_AFU_DSM_BASE

CSR_AFU_DSM_BASE is a 64-bit register that stores the virtual address of the DSM address space. The GEMM uses the DSM for FPGA to CPU signaling. The GEMM IP always writes to a specific address in the DSM address space and the software should poll on this address to check if a GEMM computation is complete.

The address of the status complete flag can be found using

*StatusAddr = (CSR_AFU_DSM_BASE) + 0x40.

StatusAddr is a 32-bit CSR in the DSM address space. The GEMM software should poll bit 0 within this CSR to determine when a GEMM computation is complete.

Table 9: StatusAddr CSR (within DSM) description

Bits	Attribute	Default	Description
StatusAddr[0]	RW	0	0: GEMM compute not complete 1: GEMM compute complete

The software must reset the StatusAddr[0] before the start of the next execution.

Figure 3: StatusAddr location in DSM Address Space

CSR_VERSION

Software can read the CSR_VERSION register to get the GEMM version.

Table 10: CSR_VERSION description

Bits	Attribute	Description
CSR_VERSION[63:0]	RO	[63:48] – RSVD [47:32] – Major Revision [31:16] – Minor Revision [15:0] – Patch Revision

CSR_CTL

CSR_CTL is used to control when the GEMM IP starts computing. Under normal operations, the GEMM IP will complete and stop on its own. In the rare case that a computation must be ended early, the CSR_CTL allows for a way to force the GEMM IP to stop computing.

Table 11: CSR_CTL description

Bits	Attribute	Default	Description
CSR_CTL[1:0]	RW	00	00 – RSVD 01 – GEMM Start 10 – GEMM Stop 11 – RSVD

CSR_CFG

CSR_CFG is used to configure the channels and types of memory requests that occur during computation of the GEMM.

Table 12: CSR_CFG description

Bits	Attribute	Default	Description
CSR_CFG[17:16]	RW	00	Channel Type Select 00 – VA 01 – VL0 10 – VH0 11 – VH1
CSR_CFG[11:8]	RW	0000	Read Type Select 0000 - eREQ_RDLINE_I 0001 - eREQ_RDLINE_S Others - RSVD
CSR_CTL[3:0]	RW	0000	Write Type Select 0000 – eREQ_WRLINE_I 0001 – eREQ_WRLINE_M Others - RSVD

CSR_SRC_ADDR_A

The software writes the memory pointer of matrix A into this CSR. The GEMM IP uses this address to calculate the offsets to read matrix A data.

Table 13: CSR_SRC_ADDR_A description

Bits	Attribute	Description
CSR_SRC_ADDR_A[63:0]	RW	[63:0] - Memory pointer to matrix A

CSR_SRC_ADDR_B

The software writes the memory pointer of matrix B into this CSR. The GEMM IP uses this address to calculate the offsets to read the matrix B data.

Table 14: CSR_SRC_ADDR_B description

Bits	Attribute	Description
CSR_SRC_ADDR_B[63:0]	RW	[63:0] - Memory pointer to matrix B

CSR_SRC_ADDR_C

The software writes the memory pointer of matrix C into this CSR. The GEMM IP uses this address to calculate the offsets to write the matrix C data.

Table 15: CSR_SRC_ADDR_C description

Bits	Attribute	Description
CSR_SRC_ADDR_C[63:0]	RW	[63:0] - Memory pointer to matrix C

CSR_NUM_BLOCKS

The CSR_NUM_BLOCKS specifies the number of blocks in a workload that are processed across the inner (a.k.a common) dimension during the matrix multiplication.

Table 16: CSR_NUM_BLOCKS description

Bits	Attribute	Description
CSR_NUM_BLOCKS[63:0]	RW	[63:0] - Number of blocks in a workload

The number of blocks in a workload (i.e. the value that software needs to write into this register) is defined by the following equation:

k / (ELEMENTS_PER_VECTOR * CSR_GROUP_SIZE)

Equation Parameters	Description
k	The number of columns in matrix A (or the number of rows in matrix B)
ELEMENTS_PER_VECTOR	Equals 8 in FP-32 Equals 32 in INT-8
CSR_GROUP_SIZE	Value from another CSR

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

CSR_NUM_PARTS_A

The CSR_NUM_PARTS_A specifies the number of parts that matrix A can be partitioned and processed across the outer dimension of matrix A (i.e. the rows of matrix A) during the matrix multiplication.

Table 17: CSR_NUM_PARTS_A description

Bits	Attribute	Description
CSR_NUM_PARTS_A[63:0]	RW	[63:0] - Number of partitions in matrix A

The number of partitions in matrix A (i.e. the value that software needs to write into this register) is defined by the following equation:

m / (10 * CSR_CHUNK_SIZE\ A)

Equation Parameters	Description
m	The number of rows in matrix A
CSR_CHUNK_SIZE_A	Value from another CSR

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

CSR_NUM_PARTS_B

The CSR_NUM_PARTS_B specifies the number of parts that matrix B can be partitioned and processed across the outer dimension of matrix B (i.e. the columns of matrix B) during the matrix multiplication.

Table 18: CSR_NUM_PARTS_B description

Bits	Attribute	Description
CSR_NUM_PARTS_B[63:0]	RW	[63:0] – Number of partitions in matrix B

The number of partitions in matrix B (i.e. the value that software needs to write into this register) is defined by the following equation:

n / (16 * CSR_CHUNK_SIZE_B)

Equation Parameters	Description
n	The number of columns in matrix B
CSR_CHUNK_SIZE_B	Value from another CSR

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

CSR_NUM_PARTS_C

The CSR_NUM_PARTS_C specifies the number of blocks in the matrix C result.

Table 19: CSR_NUM_PARTS_C description

Bits	Attribute	Description
CSR_NUM_PARTS_C[63:0]	RW	[63:0] – Number of partitions in matrix C

The number of blocks in matrix C (i.e. the value that software needs to write into this register) is equal to CSR_NUM_PARTS_A*CSR_NUM_PARTS_B.

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

CSR_NUM_ROWS_X_NUM_BLOCKS

This CSR must be written by software, and it should contain the product of the number of rows (10) in the GEMM core and number of blocks represented by CSR_NUM_BLOCKS.

Table 20: CSR_NUM_ROWS_X_NUM_BLOCKS description

Bits	Attribute	Description
CSR_NUM_ROWS_X_NUM_BLOCKS[63:0]	RW	[63:0] - 10 * CSR_NUM_BLOCKS

CSR_NUM_COL_X_NUM_BLOCKS

This CSR must be written by software, and it should contain the product of the number of columns (16) in the GEMM core and number of blocks represented by CSR_NUM_BLOCKS.

Table 21: CSR_NUM_COL_X_NUM_BLOCKS description

Bits	Attribute	Description
CSR_NUM_COL_X_NUM_BLOCKS[63:0]	RW	[63:0] - 16 * CSR_NUM_BLOCKS

CSR_NUM_CACHE_LINES_C

This CSR must be written by software, and it should contain the number of cache lines that must be written to complete the computation of matrix C. This value is used by the HW to check for the condition that the GEMM computation has completed.

Table 22: CSR_NUM_CACHE_LINES_C description

Bits	Attribute	Description
CSR_NUM_CACHE_LINES_C[63:0]	RW	[63:0] - Number of cache line writes needed for matrix C to finish computation

CSR_CHUNK_SIZE_A

The CSR_CHUNK_SIZE_A specifies the number of vectors that are contained in a chunk of matrix A. Because this CSR controls the number of vectors that are sent to the matrix A side of the PEs, it also means that this CSR directly controls the number of rows that are contained within a block of matrix A. In other words, this CSR allows for dynamically selecting the outer dimension of matrix A that is used for the matrix multiplication. The number of elements used for the matrix multiplication in the outer dimension of matrix A is defined by the following equation:

Rows in a block of matrix A = 10 * CSR_CHUNK_SIZE_A

Table 23: CSR_CHUNK_SIZE_A description

Bits	Attribute	Description
CSR_CHUNK_SIZE_A [63:0]	RW	[63:0] - Number of vectors in a chunk of matrix A

It should be noted that the number of vectors in a chunk of matrix A (along with the number of vectors in a chunk of matrix B) is what determines the size of the matrix C section that is calculated within each PE. This specific size that is calculated within each PE is equal to CSR_CHUNK_SIZE_A x CSR_CHUNK_SIZE_B. As an example, if CSR_CHUNK_SIZE_A is set to 10 and CSR_CHUNK_SIZE_B is set to 11, then each PE calculates a 10x11 section of the matrix C result.

It should be noted that there are some restrictions on the values that can be used in CSR_CHUNK_SIZE_A. The minimum allowed value is 2, the maximum allowed value is 32, and the value of CSR_CHUNK_SIZE_A*CSR_CHUNK_SIZE_B must be greater than 50.

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

CSR_CHUNK_SIZE_B

The CSR_CHUNK_SIZE_B specifies the number of vectors that are contained in a chunk of matrix B. Because this CSR controls the number of vectors that are sent to the matrix B side of the PEs, it also means that this CSR directly controls the number of columns that are contained within a block of matrix B. In other words, this CSR allows for dynamically selecting the outer dimension of matrix B that is used for the matrix multiplication. The number of elements used for the matrix multiplication in the outer dimension of matrix B is defined by the following equation:

Columns in a block of matrix B = 16 * CSR_CHUNK_SIZE_B

Table 24: CSR_CHUNK_SIZE_B description

Bits	Attribute	Description
CSR_CHUNK_SIZE_B [63:0]	RW	[63:0] - Number of vectors in a chunk of matrix B

It should be noted that the number of vectors in a chunk of matrix B (along with the number of vectors in a chunk of matrix A) is what determines the size of the matrix C section that is calculated within each PE. This specific size that is calculated within each PE is equal to CSR_CHUNK_SIZE_A x CSR_CHUNK_SIZE_B. As an example, if CSR_CHUNK_SIZE_A is set to 10 and CSR_CHUNK_SIZE_B is set to 11, then each PE calculates a 10x11 section of the matrix C result.

It should be noted that there are some restrictions on the values that can be used in CSR_CHUNK_SIZE_B. The minimum allowed value is 5, the maximum allowed value is 32, and the value of CSR_CHUNK_SIZE_A*CSR_CHUNK_SIZE_B must be greater than 50.

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

CSR_GROUP_SIZE

The CSR_GROUP_SIZE specifies the number of chunks that are sent to both the matrix A side and the matrix B side of the PEs before the matrix multiplication calculations move on to the next block. Because this CSR controls the number of chunks (and thus vectors) that are sent to both the matrix A and the matrix B side of the PEs, it also means that this CSR directly controls the number of elements that are contained within the common dimension of a block of matrix A and matrix B. In other words, this CSR allows for dynamically selecting the inner (a.k.a. common) dimension used for the matrix multiplication. The number of elements used for the matrix multiplication in the common dimension is defined by the following equation:

Common dimension elements in a block of matrix A and B = ELEMENTS_PER_VECTOR * CSR_GROUP_SIZE

Table 25: CSR_GROUP_SIZE description

Bits	Attribute	Description
CSR_GROUP_SIZE [63:0]	RW	[63:0] - Number of chunks in a group within matrix A (or matrix B)

It should be noted that there are some restrictions on the values that can be used in CSR_GROUP_SIZE. The value must be an even number, the minimum allowed value is 2, and the maximum allowed value is 16.

Note that the value in this CSR needs to be programmed by the software before the start of GEMM computation.

GEMM Memory Organization

Matrix A

Matrix A of dimension m by k is split into block sizes as defined by the below matrix size:

(10 * CSR_CHUNK_SIZE_A) x (ELEMENTS_PER_VECTOR * CSR_GROUP_SIZE)

The best case is to process a matrix A whose m and k dimensions are a multiple of “10*CSR_CHUNK_SIZE_A” and “ELEMENTS_PER_VECTOR*CSR_GROUP_SIZE” respectively. This is often possible due to the ability to choose a CSR_CHUNK_SIZE_A and CS_GROUP_SIZE that best align with the size of matrix A. However, for the cases where the matrices are not multiples of the block sizes, those matrices can be zero padded to the nearest size of m and k that is a multiple of the block size.

Figure 4: Partitioning matrix A into blocks

Memory Organization

From the GEMM hardware point of view, everything is a Cache Line (CL) aligned address. Hence, it is important to align the contents of matrix A memory in a CL aligned format as shown in the following figures.

Figure 5: Partition a block of matrix A into Feeders and representing in GEMM format

Figure 6: Partition two blocks of matrix A into Feeders and represent in GEMM format

Figures 7, 8, and 9 show how the matrix A with index A[i][j] is mapped to a cache line for different values of CSR_CHUNK_SIZE_A. Note that the byte field in the following figures indicate the byte index in that cache line, which is 64 bytes in length. So “Byte: 0” is the 0th byte in the cache line and “Byte: 4” is the 4th byte in the cache line.

Also note that these figures assume each element in A is an FP-32 (4 byte) element. In the case that each element is an INT-8 (1 byte) element, then each A[i][j] element in the figures below actually contains four 1-byte elements instead of one 4-byte element.

Figure 7: The matrix A element mapping in Cache Line (CSR_CHUNK_SIZE_A = 32)

Figure 8: The matrix A element mapping in Cache Line (CSR_CHUNK_SIZE_A = 16)

Figure 9: The matrix A element mapping in Cache Line (CSR_CHUNK_SIZE_A = 5)

Matrix B

Matrix B of dimension k by n is split into block sizes as defined by the below matrix size:

(ELEMENTS_PER_VECTOR * CSR_GROUP_SIZE) x (16 * CSR_CHUNK_SIZE_B)

The best case is to process a matrix B whose k and n dimensions are a multiple of “ELEMENTS_PER_VECTOR*CSR_GROUP_SIZE” and “16*CSR_CHUNK_SIZE_B” respectively. This is often possible due to the ability to choose a CSR_CHUNK_SIZE_B and CS_GROUP_SIZE that best align with the size of matrix B. However, for the cases where the matrices are not multiples of the block sizes, those matrices can be zero padded to the nearest size of k and n that is a multiple of the block size.

Figure 10: Partitioning matrix B into blocks

Memory Organization

From the GEMM hardware point of view, everything is a Cache Line (CL) aligned address. Hence, it is important to align the contents of matrix B memory in a CL aligned format as shown in the following figure.

Figure 11: Partition a block of matrix B into Feeders and represent in GEMM format

Figure 12, 13, and 14 show how the matrix B with index B[i][j] is mapped to a cache line for different values of CSR_CHUNK_SIZE_B. Note that the byte field in the following figures indicate the byte index in that cache line, which is 64 bytes in length. So “Byte: 0” is the 0th byte in the cache line and “Byte: 4” is the 4th byte in the cache line.

Also note that these figures assume each element in B is an FP-32 (4 byte) element. In the case that each element is an INT-8 (1 byte) element, then each B[i][j] element in the figures below actually contains four 1-byte elements instead of one 4-byte element.

Figure 12: The matrix B element mapping in Cache Line (CSR_CHUNK_SIZE_B = 32)

Figure 13: The matrix B element mapping in Cache Line (CSR_CHUNK_SIZE_B = 16)

Figure 14: The matrix B element mapping in Cache Line (CSR_CHUNK_SIZE_B = 5)

Matrix C

The resultant matrix C is organized in the memory with the following format. Each block of matrix C is split into block sizes as defined by the below matrix size:

(10 * CSR_CHUNK_SIZE_A) x (16 * CSR_CHUNK_SIZE_B)

The number of blocks in the matrix C result depends on the values in CSR_NUM_PARTS_A and CSR_NUM_PARTS_B. For example, if CSR_NUM_PARTS_A = 2 and CSR_NUM_PARTS_B = 2, then the total number of blocks in matrix C will be 4, with each block consisting of a “10*CSR_CHUNK_SIZE_A” x “16*CSR_CHUNK_SIZE_B” section of the final resultant matrix. The number of cache lines in the resultant blocks of matrix C can be calculated using the below equation:

Cache lines in a block of matrix C = (CSR_CHUNK_SIZE_A * CSR_CHUNK_SIZE_B * 160 * 4) / 64

Each PE in the systolic array produces a “CSR_CHUNK_SIZE_A”x”CSR_CHUNK_SIZE_B” block in the final matrix C result. Some cases using various values for CSR_CHUNK_SIZE_A and CSR_CHUNK_SIZE_B are illustrated in the following figures.

Figure 15: Matrix C cache line order (CSR_CHUNK_SIZE_A = 32 and CSR_CHUNK_SIZE_B = 32)

Figure 16: Matrix C cache line order (CSR_CHUNK_SIZE_A = 10 and CSR_CHUNK_SIZE_B = 20)

The “CSR_CHUNK_SIZE_A”x”CSR_CHUNK_SIZE_B” elements from each PE are stored in a linear fashion. When draining the results from the PE, each column PE writes to a fixed location in the cache line. For example, PE[0] writes to the 0th byte in a cache line, PE[1] writes to the 4th byte and so on. This is illustrated in following figures.

Figure 17: Matrix C cache line format (CSR_CHUNK_SIZE_A = 32 and CSR_CHUNK_SIZE_B = 32)

Figure 18: Matrix C cache line format (CSR_CHUNK_SIZE_A = 10 and CSR_CHUNK_SIZE_B = 20)

If matrix A is partitioned in CSR_NUM_PARTS_A blocks and matrix B is partitioned in CSR_NUM_PARTS_B blocks, then the total number of C blocks is the product of CSR_NUM_PARTS_A and CSR_NUM_PARTS_B. For cases where there are more than 1 block of matrix C, the subsequent matrix C blocks are filled in column first format. This is illustrated in the following figures.

Figure 19: Matrix C cache line order for multiple blocks (CSR_CHUNK_SIZE_A = 32 and CSR_CHUNK_SIZE_B = 32)

Figure 20: Matrix C cache line order for multiple blocks (CSR_CHUNK_SIZE_A = 10 and CSR_CHUNK_SIZE_B = 20)

BBB_cci_gemm

FPGA GEMM IP Overview

Control and Status Registers

CSR_AFH_DFH_BASE

AFU GUID

GEMM FP-32

CSR_AFH_ID_L

CSR_AFH_ID_H

GEMM INT-16

CSR_AFH_ID_L

CSR_AFH_ID_H

GEMM INT-8

CSR_AFH_ID_L

CSR_AFH_ID_H

CSR_AFU_DSM_BASE

CSR_VERSION

CSR_CTL

CSR_CFG

CSR_SRC_ADDR_A

CSR_SRC_ADDR_B

CSR_SRC_ADDR_C

CSR_NUM_BLOCKS

CSR_NUM_PARTS_A

CSR_NUM_PARTS_B

CSR_NUM_PARTS_C

CSR_NUM_ROWS_X_NUM_BLOCKS

CSR_NUM_COL_X_NUM_BLOCKS

CSR_NUM_CACHE_LINES_C

CSR_CHUNK_SIZE_A

CSR_CHUNK_SIZE_B

CSR_GROUP_SIZE

GEMM Memory Organization

Matrix A

Memory Organization

Matrix B

Memory Organization

Matrix C

Clone this wiki locally