### Embedded SoC

Term Project

Designing General Purpose Parallel Compute Unit 범용 병렬 계산기 설계

> 우주선 조 2014161001 강승우







## Concept

Single Instruction performs 2<sup>n</sup> operations

Each thread has 32\*32 bit general purpose registers.

No programmable instruction memory, no flow control on its own.

only logic & arithmetic operation supported.

Instructions should be fed from outside.

### Core (Instruction Feeder) Thread Thread Thread Thread Thread Thread **Thread** 2^n-1

### Goal

# Advanced SIMD operation

- Numerous independent processors executes same command at the same time
- Fast local memory access

# Platform Independence

- Synthesis level independence (IP Interfacing)
- Application level independence

# Ultra Fast Massive Data Processing

- 5 Stage pipelined architecture (Nested 2-stage pipelined FPU)
- Intensive RISC Instructions
- Can accept 200MHz input clock
  - $\rightarrow$  64 \* 50MHz = 3.2GFlops per Core

# System Overview



## Core, Thread, Task

Core

Instruction Feeder

Core includes Threads.

Threads execute instructions in parallel

Large number of operations which is more than m aximum thread count, will be abstract by task

Thread
[Task 0....n]

Thread
[Task 0....n]

Thread
[Task 0....n]

Thread [Task 0....n]

Instruction from outside → Enqueued ACLK -ACLK----POP EN-> - FIFO→ -Instr [31:0]-INSTR[31:0] -Instr\_in[31:0]-INSTR QUEUE 2KBeat size -PUSH-INSTR\_CLK INSTR\_FULL -No more space in queue-Next Instr[31:0] Queue is empty No operation INSTR EMPTY In progress -Operand Regs[7:0]-Instr[5:0] Operands/ Condition[3:0] Constants #0 Instruction decoder Bubble Generator 3 CLK FROM [21:0] QUEUE\_EMPTY Should stall Destination Reg[3:0] CW#0[?:0] Operands/Constants COND#1[3:0] CW #1[?:0] #1[21:0] MEMCLK COND#1[3:0] CW#1[?:0] Operands/ GMEM\_ADDR[31:0] Constants #1 [21:0] GMEM\_WDAT[31:0] GMEM\_WEN is a register symbol. If there's not any specific clk signal assigned, then it will assign ACLK defaultly. Use ACLK Constant Memory, Constant #0 [9:0] Two ports 1KBeats PORT 2 -GMEMDAT #1[31:0] PORT 1 Constant memory. This is a ROM in view of the thread Core ADDR Thread ≯MEM ADDR[31:0] -LMEM ADDR[31:10]-Select[63:0] DECODER **Controller Domain** 



#### Thread

Pipelined FPU

Nested 1KBeat local memory (DP SRAM)

Conditional Instruction Execution







#### Task

#### Abstract parallel process





Serial processing on actual execution

### Interface

#### Data transaction

#### Queue instruction

Global memory write (Constant ROM)

Local memory access

[Conditional]
Arithmetic Operations

Load/Store operation

Write

Read

Floating point

Integer

### Thread ISA

- 32 bit RISC Instruction
  - Morph of ARM Architecture
- 3 Instruction Models

| COND  | OPR S |    | RD |    | RA |    | IMM |   | RB |   |
|-------|-------|----|----|----|----|----|-----|---|----|---|
| 31 28 | 27 23 | 22 | 21 | 17 | 16 | 12 | 11  | 5 | 4  | 0 |
| COND  | OPR   | S  | RD |    | RA |    | IMM |   |    |   |
| 31 28 | 27 23 | 22 | 21 | 17 | 16 | 12 | 11  |   |    | 0 |
| COND  | OPR S |    | RD |    |    |    | IM  | M |    |   |
| 31 28 | 27 23 | 22 | 21 | 17 | 16 |    |     |   |    | 0 |

### Thread ISA

#### ALU

- MOV D := M
- MVN D := 'M+1
- ADC D := A+M+C
- SBC D := A+'M+'C
- AND D := A&M
- ORR D := A | M
- XOR D := A^M
- ADI D := A+I
- SBI D := A-I
- MVI D := I

#### **FPU**

- ITOF
- FTOI
- FMUL
- FDIV
- FADD
- FSUB
- FNEG

#### **GENERAL**

- LDL D := L[A+I]
- LDC D := G[A+I]
- LDCI D := G[I]
- STL L[B+I] := A
- L : Local memory
- G : Const memory

### Instruction Level Parallelism



- 5-stage pipeline
- Hardware prevents data hazard by inserting bubble
- CPI decreases on every bubble...

#### Instruction Level Parallelism

- Independent tasks run on single core at the same time.
- There are 32 Registers inside of each thread
- Simulate two internal parallel process, assigning 16 registers per task

```
mvi r15, #0
mov r0, r1
add r2, r0, r3
...
mvi r15, #16
mov r0, r1
add r2, r0, r3
add r2, r0, r3
...

mvi r15, #0
mvi r31, #16
mov r0, r1
add r2, r0, r3
add r18, r16, r19
...
```

This is not hardware-supported, following cpp code will represent example of this paradigm

# Code Example

```
void mult(float const* a, float const*b, size_t NumData, float* dst)
{
    CalcDevice pc;
    pc.ClearData();
    pc.SetSpacePerTask(sizeof(float)*3);
    pc.WritePerTask(
                                 a[0]
                                             a[1]
                                                         a[2]
        /*Target data*/a,
        /*Number of tasks*/NumData,
        /*Ofst from task space*/0,
        /*Size per write*/sizeof(float));
    pc.WritePerTask(
                                 a[0]
                                     b[0]
                                             a[1]
                                                 b[1]
                                                         a[2]
                                                             b[2]
        /*Target data*/b,
        /*Number of tasks*/NumData,
        /*Ofst from task space*/4,
        /*Size per write*/sizeof(float) );
```

# Code Example

```
pc.ClearTaskQueue();
pc.QueueTaskLoadWord(
    // This operation will execute
    // LDR DST, [r15 + OFST]
    // per task
    /*Target Register*/ r1,
    /*Data Ofst*/0);
pc.QueueTaskLoadWord(
    /*Target Register*/ r2,
    /*Data Ofst*/4);
pc.QueueTaskFMul(
    /*Target Register*/ r0,
    /*OPR A, B*/r1,
     r2);
pc.QueueTaskStoreWord(
                              a[0]
                                     b[0]
                                           a[0]
                                                 a[1]
                                                       b[1]
                                                              a[1]
                                                                    a[2]
                                                                          b[2]
                                                                                a[2]
    /*Target Register*/ r0,
                                           *b[0]
                                                              *b[1]
                                                                                *b[2]
    /*Data Ofst*/8);
```

# Code Example

```
pc.ExecuteTaskQueue();
pc.Flush();
pc.ReadPerTask(
    /*Destination*/dst,
    /*Number of tasks*/NumData,
    /*Ofst from task space*/8,
    /*Size per read*/sizeof(float)
       A kind of return procedure.
```

### Milestone

Design architecture & ISA

Implement pipelined FPU

Implement GPPCU device

Composite overall system (As Memory Mapped Device)

Program application (Device driver, Application)

# Progress On Week 5, May 2019

#### Concept

Week 3, May 2019 Thread Concept Week 3, May 2019 Core Concept Week 3, May 2019 Device Concept Week 4, May 2019 Interface Concept Week 4, May 2019 User-Level Usage Concept



#### Layout

Week 4, May 2019 Core Diagram Week 5, May 2019 Thread Diagram Week 5, May 2019 System Diagram Week 1, June 2019 Thread ISA



#### **Implementation**

\*On Progress\* Pipelined FPU

Processor control word design

Pipeline design (Minimal prevent data hazard)

