### 21CSS201T COMPUTER ORGANIZATION AND ARCHITECTURE

**UNIT-5** 

#### **Contents**

- Parallelism: Need, types, applications and challenges
- Architecture of Parallel Systems-Flynn's classification
- ARM Processor: The thumb instruction set
- Processor and CPU cores, Instruction Encoding format
- Memory load and Store instruction
- Basics of I/O operations.
- Case study: ARM 5 and ARM 7 Architecture

# Parallelism: Need, types, applications and challenges

### Parallelism

- Executing two or more operations at the same time is known as parallelism.
- Parallel processing is a method to improve computer system performance by executing two or more instructions simultaneously
- A parallel computer is a set of processors that are able to work cooperatively to solve a computational problem.
- Two or more ALUs in CPU can work concurrently to increase throughput
- The system may have two or more processors operating concurrently

# Goals of parallelism

- To increase the computational speed (ie) to reduce the amount of time that you need to wait for a problem to be solved
- To increase throughput (ie) the amount of processing that can be accomplished during a given interval of time
- To improve the performance of the computer for a given clock speed
- To solve bigger problems that might not fit in the limited memory of a single CPU

# Applications of Parallelism

- Numeric weather prediction
- Socio economics
- Finite element analysis
- Artificial intelligence and automation
- Genetic engineering
- Weapon research and defence
- Medical Applications
- Remote sensing applications

### **Applications of Parallelism**



# Types of parallelism

- 1. Hardware Parallelism
- 2. Software Parallelism

#### Hardware Parallelism :

The main objective of hardware parallelism is to increase the processing speed. Based on the hardware architecture, we can divide hardware parallelism into two types: Processor parallelism and memory parallelism.

#### Processor parallelism

Processor parallelism means that the computer architecture has multiple nodes, multiple CPUs or multiple sockets, multiple cores, and multiple threads.

• **Memory parallelism** means shared memory, distributed memory, hybrid distributed shared memory, multilevel pipelines, etc. Sometimes, it is also called a parallel random access machine (PRAM). "It is an abstract model for parallel computation which assumes that all the processors operate synchronously under a single clock and are able to randomly access a large shared memory. In particular, a processor can execute an arithmetic, logic, or memory access operation within a single clock cycle". This is what we call using overlapping or pipelining instructions to achieve parallelism.

### Hardware Parallelism

- One way to characterize the parallelism in a processor is by the number of instruction issues per machine cycle.
- If a processor issues k instructions per machine cycle, then it is called a **k-issue processor**.
- In a modern processor, two or more instructions can be issued per machine cycle.
- A conventional processor takes one or more machine cycles to issue a single instruction. These types of processors are called one-issue machines, with a single instruction pipeline in the processor.
- A multiprocessor system which built n k-issue processors should be able to handle a maximum of nk threads of instructions simultaneously

## Software Parallelism

- It is defined by the control and data dependence of programs.
- The degree of parallelism is revealed in the program flow graph.
- Software parallelism is a function of algorithm, programming style, and compiler optimization.
- The program flow graph displays the patterns of simultaneously executable operations.
- Parallelism in a program varies during the execution period.
- It limits the sustained performance of the processor.

#### Example Detection of parallelism in a program

Consider the simple case in which each process is a single HLL statement. We want to detect the parallelism embedded in the following instructions  $P_1, P_2, P_3, P_4$ , and  $P_5$ .

$$P_1: C = D \times E$$
 $P_2: M = G + C$ 
 $P_3: A = B + C$ 
 $P_4: C = L + M$ 
 $P_5: F = G \div E$ 

Assume that each statement requires one step to execute.

No pipelining is considered here.

The dependence graph is



A dependence graph showing both data dependence (solid arrows) and resource dependence (dashed arrows)

#### **Sequential Execution:**

 $P_1: C = D \times E$ 

 $P_2: M = G + C$ 

 $P_3: A = B+C$ 

 $P_{4}: C = L + M$ 

 $P_5: F = G \div E$ 

Sequential execution in five steps, assuming one step per statement (no pipelining)



#### **Parallel Execution:**

$$P_1: C = D \times E$$
 $P_2: M = G + C$ 
 $P_3: A = B + C$ 
 $P_4: C = L + M$ 
 $P_5: F = G \div E$ 

- Parallel execution in THREE steps, assuming TWO adders are available per step.
- If TWO adders are available, the parallel execution requires only THREE steps.
- Only 5 pairs can execute in parallel, if there is no resource conflicts.
- In this example, as shown in the fig, only P2 | P3 | P5 is possible.



#### Mismatch between S/W & H/W Parallelism:

### Example:

$$A = L_1*L_2 + L_3*L_4$$

$$B = L_1 L_2 - L_3 L_4$$

#### **Software Parallelism:**

There are 8 instructions;

FOUR Load instructions (L1, L2, L3 & L4).

TWO Multiply instructions (X1 & X2).

ONE Add instruction (+)

ONE Subtract instruction (-)

 The parallelism varies from 4 to 2 in three cycles.



Average S/W Parallelism = 
$$\frac{8 \text{ cycles}}{3 \text{ cycles}} = \frac{8}{3} = 2.67$$

#### **Hardware Parallelism:**

#### Parallel Execution:

- Using TWO-issue processor:
- The processor can execute one memory access (Load or Store) and one arithmetic operation (multiply, add, subtract) simultaneously.
- The program must execute in 7 cycles.
- The h/w parallelism average is 8/7=1.14.
- It is clear from this example the mismatch between the s/w and h/w parallelism.



### **Example:**

- A h/w platform of a Dual-Processor system, single issue processors are used to execute the same program.
- Six processor cycles are needed to execute the 12 instructions by two processors.
- S1 & S2 are two inserted store operations.
- L5 & L6 are two inserted load operations.
- The added instructions are needed for inter-processor communication through the shared memory.



# Software Parallelism - types

#### Parallelism in Software

- Instruction level parallelism
- Task-level parallelism
- Data parallelism
- Transaction level parallelism

# Instruction level parallelism

•Instruction level Parallelism (ILP) is a measure of how many operations can be performed in parallel at the same time in a computer.

 Parallel instructions are set of instructions that do not depend on each other to be executed.

• ILP allows the compiler and processor to overlap the execution of multiple instructions or even to change the order in which instructions are executed.

# Eg. Instruction level parallelism

Consider the following example

- 1. x = a + b
- 2. y=c-d
- 3. z=x \* y

Operation 3 depends on the results of 1 & 2

So 'Z' cannot be calculated until X & Y are calculated

But 1 & 2 do not depend on any other. So they can be computed simultaneously.

• If we assume that each operation can be completed in one unit of time then these 3 operations can be completed in 2 units of time.

• ILP factor is 3/2=1.5 which is greater than without ILP.

• A superscalar CPU architecture implements ILP inside a single processor which allows faster CPU throughput at the same clock rate.

# Data-level parallelism (DLP)

• Data parallelism is parallelization across multiple processors in parallel computing environments.

• It focuses on distributing the data across different nodes, which operate on the data in parallel.

 Instructions from a single stream operate concurrently on several data

# DLP - example

- Let us assume we want to sum all the elements of the given array of size n and the time for a single addition operation is Ta time units.
- In the case of sequential execution, the time taken by the process will be n\*Ta time unit
- if we execute this job as a data parallel job on 4 processors the time taken would reduce to (n/4)\*Ta + merging overhead time units.

# DLP in Adding elements of array



# DLP in matrix multiplication



• A[m x n] dot B [n x k] can be finished in O(n) instead of O(m\*n\*k) when executed in parallel using m\*k processors.

• The locality of data references plays an important part in evaluating the performance of a data parallel programming model.

• Locality of data depends on the memory accesses performed by the program as well as the size of the cache.

# Flynn's Classification

- This taxonomy distinguishes multi-processor computer architectures according to the two independent dimensions of Instruction stream and Data stream.
- An instruction stream is sequence of instructions executed by machine.
- A data stream is a sequence of data including input, partial or temporary results used by instruction stream.
- Each of these dimensions can have only one of two possible states: Single or Multiple.
- Flynn's classification depends on the distinction between the performance of control unit and the data processing unit rather than its operational and structural interconnections.

# Flynn's Classification

Four category of Flynn classification

|                                 | Single STREAM Multiple                      |                                               |
|---------------------------------|---------------------------------------------|-----------------------------------------------|
| Single                          | Single Instruction<br>Single Data<br>SISD   | Single Instruction<br>Multiple Data<br>SIMD   |
| INSTRUCTION<br>STREAM  Multiple | Multiple Instruction<br>Single Data<br>MISD | Multiple Instruction<br>Multiple Data<br>MIMD |

### SISD

- They are also called scalar processor i.e., one instruction at a time and each instruction have only one set of operands.
- Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle.
- Single data: only one data stream is being used as input during any one clock cycle.
- Deterministic execution.
- Instructions are execut Control Instruction Processor sequentially.

 SISD computer having one control unit, one processor unit and single memory unit.

Data

Memory

load A
load B C = A + Bstore C A = B \* 2store A

### SIMD

- A type of parallel computer.
- Single instruction: All processing units execute the same instruction issued by the control unit at any given clock cycle.
- Multiple data: Each processing unit can operate on a different data element as shown if figure below the processor are connected to shared memory or interconnection network providing multiple data to processing unit

 single instruction is executed by different processing unit on



Control

### **MISD**

- A single data stream is fed into multiple processing units.
- Each processing unit operates on the data independently via independent instruction.
- A single data stream is forwarded to different processing unit whic' connected to different unit and execute ins given to it by control which it is attached.

 same data flow through a linear array of processors executing different instruction streams





### **MIMD**

- Multiple Instruction: every processor may be executing a different instruction stream.
- Multiple Data: every processor may be working with a different data stream.
- Execution can be synchronous asynchronous, determinist or nondeterministic

 Different processor each processing different task.





- ARM Advanced Risc Machine
- T The Thumb 16 bit instruction set.
- D On chip Debug support.
- M Enhanced Multiplier
- I Embedded ICE (in-circuit Emulator) hardware to give break point and watch point support.

# **ARM Features**

- RISC
- 32 bit general purpose processor
- High performance , low power consumption and small size
- Large , regular Register File
- load/store architecture
- Pipelining
- Uniform and fixed-length(32 bit) instruction-(ARM)
- 3-address instruction
- Simple addressing modes

#### **ARM Features**

- Conditional execution of the instructions
- Control over both ALU and Shifter in every data processing instruction
- Multiple load/store register instructions
- Ability to perform 1clk cycle general shift & ALU operation in 1 instruction
- Coprocessor instruction interfacing
- THUMB architecture-(dense 16-bit compressed instruction set)

# THUMB Instruction Set (T variant)

- re-encoded subset of ARM instruction
- Half the size of ARM instructions(16 bit)
- Greater code density
- On execution 16 bit thumb transparently decompressed to full 32 bit ARM without loss of performance
- Has all the advantages of 32 bit core
- Low performance in time-critical code
- Doesn't include some instruction needed for exception handling

#### Thumb instruction set (T variant)

- 40% more instructions than ARM code
- 30% less external memory power than ARM code
- With 32 bit memory
  - -ARM code 40% faster than Thumb code
- With 16 bit memory
  - -Thumb code 45% faster than Arm code
- For best performance
  - -use 32 bit memory and ARM code
- For best cost and power efficiency
  - -use 16 bit memory and thumb code
- In typical embedded system
  - -Use ARM code in 32 bit on-chip memory for small speed-critical routines
  - -Use Thumb code in 16 bit off-chip memory for large non-critical routines

#### **ARM Core dataflow model**

#### **Introduction**

- In Fig 1 shows, an ARM core as functional units connected by data buses.
- □ Data enters the processor core through the Data bus. The data may be an instruction to execute or a data item.
- ☐ Figure 1 shows a Von Neumann implementation of the ARM data items and instructions share the same bus. In contrast, Harvard implementations of the ARM use two different buses.



- The instruction decoder translates instructions before they are executed. Each instruction executed belongs to a particular instruction set.
- The ARM processor, like all RISC processors, uses a load-store architecture. This
  means it has two instruction types for transferring data in and out of the
  processor: load instructions copy data from memory to registers in the core, and
  conversely the store instructions copy data from registers to memory. There are
  no data processing instructions that directly manipulate data in memory. Thus,
  data processing is carried out solely in registers.
- Data items are placed in the register file—a storage bank made up of 32-bit registers. Since the ARM core is a 32-bit processor, most instructions treat the registers as holding signed or unsigned 32-bit values.
- The sign extend hardware converts signed 8-bit and 16-bit numbers to 32-bit values as they are read from memory and placed in a register.
- ARM instructions typically have two source registers, Rn and Rm, and a single result or destination register, Rd. Source operands are read from the register file using the internal buses A and B, respectively.

- The ALU (arithmetic logic unit) or MAC (multiply-accumulate unit) takes the
  register values Rn and Rm from the A and B buses and computes a result. Data
  processing instructions write the result in Rd directly to the register file. Load and
  store instructions use the ALU to generate an address to be held in the address
  register and broadcast on the Address bus.
- One important feature of the ARM is that register Rm alternatively can be
  preprocessed in the barrel shifter before it enters the ALU. Together the barrel
  shifter and ALU can calculate a wide range of expressions and addresses. After
  passing through the functional units, the result in Rd is written back to the register
  file using the Result bus.
- For load and store instructions the incremented updates the address register before the core reads or writes the next register value from or to the next sequential memory location.
- The processor continues executing instructions until an exception or interrupt changes the normal execution flow.

#### **ARM Registers**

- General-purpose registers hold either data or an address.
- Fig 2 shows the active registers available in user mode—a protected mode normally used when executing applications.
- The ARM processor has three registers assigned to a particular task or special function: r13, r14, and r15. They are frequently given different labels to differentiate them from the other registers.
- In Fig 2, the shaded registers identify the assigned special-purpose registers:
- Register r13 is traditionally used as the stack pointer (sp) and stores the head of the stack in the current processor mode.
- Register r14 is called the link register (/r) and is where the core puts the return address whenever it calls a subroutine.
- Register r15 is the program counter (pc) and contains the address of the next instruction to be fetched by the processor.



Fig 2: ARM Register in user mode.

- The processor can operate in seven different modes. All the registers shown are 32 bits in size.
- There are up to 18 active registers: 16 data registers and 2 processor status registers. The data registers are visible to the programmer as r0 to r15.
- Depending upon the context, registers r13 and r14 can also be used as general-purpose registers, which can be particularly useful since these registers are banked during a processor mode change.
- In ARM state the registers r0 to r13 are orthogonal—any instruction that you can apply to r0 you can equally well apply to any of the other registers. However, there are instructions that treat r14 and r15 in a special way.
- In addition to the 16 data registers, there are two program status registers: cpsr and spsr (the current and saved program status registers, respectively).

#### **Processor Modes**

- First 5 bits are signifies as mode selection
- The processor mode determines which registers are active and the access rights to the *cpsr* register itself.
- Each processor mode is either privileged or nonprivileged: A privileged mode allows full read-write access to the cpsr. Conversely, a nonprivileged mode only allows read access to the control field in the cpsr but still allows read-write access to the condition flags.
- There are seven processor modes in total: six privileged modes (abort, fast interrupt request, interrupt request, supervisor, system, and undefined) and one nonprivileged mode (user).

| <ul> <li>Abort modes:</li> <li>The processor enters abort mode when there is a failed attempt to access memory.</li> </ul>                                              |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Fast interrupt request and interrupt request modes:</li> <li>These two modes correspond to the two interrupt levels available on the ARM processor.</li> </ul> |
| <ul> <li>Supervisor mode</li> <li>It is the mode that the processor is in after reset and is generally the mode that an operating system kernel operates in.</li> </ul> |
| <ul> <li>System mode</li> <li>It is a special version of user mode that allows full read-write access to the cpsr.</li> </ul>                                           |
| <ul> <li>Undefined mode</li> <li>is used when the processor encounters an instruction that is undefined or not supported by the implementation.</li> </ul>              |
| <ul> <li>□ User mode</li> <li>is used for programs and applications.</li> </ul>                                                                                         |

# Single-core computer



# Single-core CPU chip



# Multi-core architectures

 Replicate multiple processor cores on a single die.

Core 1 Core 2 Core 3 Core 4



# Multi-core CPU chip

- The cores fit on a single processor socket
- Also called CMP (Chip Multi-Processor)



# The cores run in parallel

thread 1 thread 2 thread 3 thread 4



#### Within each core, threads are time-sliced (just like o



# Instruction Encoding

- Remember that in a stored program computer, instructions are stored in memory (just like data)
- Each instruction is fetched (according to the address specified in the PC), decoded, and executed by the CPU
- The ISA defines the format of an instruction (syntax) and its meaning (semantics)
- An ISA will define a number of different instruction formats.
- Each format has different fields
- The OPCODE field says what the instruction does (e.g. ADD)
- The OPERAND field(s) say where to find inputs and outputs of the instruction.

# MIPS Instruction Encoding

The nice thing about MIPS (and other RISC machines) is that it has very few instruction formats (basically just 3)

- All instructions are the same size (32 bits = 1 word)
- The formats are consistent with each other (i.e. the OPCODE field is always in the same place, etc.)
- The three formats:
- I-type (immediate)
- 2. R-type (register)
- 3. J-type (jump)

# I-type (immediate)

An immediate instruction has the form:

XXXI rt, rs, immed

- Recall that we have 32 registers, so we need ??? bits each to specify the rt and rs registers
- We allow 6 bits for the opcode (this implies a maximum of ??? opcodes, but there are actually more, see later)
- This leaves 16 bits for the immediate field

| 31 | 25  | )  | 20 |   | 15 |       |   |
|----|-----|----|----|---|----|-------|---|
|    | OPC |    | rs |   | rt | immed |   |
|    | 26  | 21 | 1  | 1 | 6  |       | 0 |

# I-type Example

Example:

ADDI \$a0, \$12, 33

# a0 <- r12 + 33

The ADDI opcode is 8, register a0 is register # 4

31

|   | 25 | 20 | 15 |    |
|---|----|----|----|----|
|   | 8  | 12 | 4  | 33 |
| _ | 26 | 21 | 16 |    |

#### **Load-Store Formats**

- A memory address is 32 bits, so it cannot be directly encoded in an instruction
- Recall the use of a base register + offset (16-bits) in the load-store instructions
- Thus, we need an OPCODE, a destination/source register (destination for load, source for store), a base register, and an offset
- This sounds very similar to the I-type format... example: LW \$14, 8(\$sp) # r14 is loaded from stack+8
- The LW opcode is 35 (0x23)4

| 31<br>35 | 25<br>29 | 14 | <del></del> |   |
|----------|----------|----|-------------|---|
|          | 26       | 21 | 16          | 0 |

# R-type (register) format

General form:

XXX rd, rt, rs

- Arithmetic-logical and comparison instructions require the encoding of 3 registers, the rest can be used to specify the OPCODE.
- To keep the format as regular as possible, the OPCODE has a primary "opcode" and a "function" field.
- We also need 5 bits for the shift-amount, in case of SHIFT instructions.
- The 16 bits used for the immediate field in the I-type instruction are split into 5 bits for rd, 5 bits for shift-amount, and 6 bits for function (the other fields are the same).

| 30PC | 255 | rt20 | rd 15 | sht 10func | t 5 |   |
|------|-----|------|-------|------------|-----|---|
|      | 26  | 21   | 16    | 11         | 6   | 0 |

# R-type Example

• SUB \$7, \$8, \$9

• The opcode for all R-type instructions is zero, the function code for SUB is 34, the shift amount is zero

| • | 31 | 25 | 20 | 15 | 10 | 5  |   |
|---|----|----|----|----|----|----|---|
|   | 0  | 8  | 9  | 7  | 0  | 34 |   |
|   |    | 26 | 21 | 16 | 11 | 6  | 0 |

# J-type (Jump) Format

 For a jump, we only need to specify the opcode, and we can use the other bits for an address:



- We only have 26 bits for the address, but MIPS addresses are 32 bits long...
- Because the address must reference an instruction, which is a word address, we can shift the address left by 2 bits (giving us 28 bits). We get the other 4 bits by combining with the 4 high-order bits of the PC.

# **Branch Addressing**

There are 2 kinds of branches:

- 1.EQ/NEQ family (compares 2 regs for (in)equality), example: BEQ \$14, \$8, 1000
  - 2. Compare-to-zero family (compares 1 reg to zero), example:

BGEZ \$14, 1000

- Both "families" require OPCODE, rs register, and offset
- (1.) requires an additional register (rt)
- (2.) requires some encoding for (>=, <=, )



0

or code (for >, <, etc)

# Branch example

BEQ \$14, \$8, 1000

# PC := PC+1000 if r14==r8

BGEZ \$14, 20

# PC := PC+20 if r14 >= 0

• The opcode for BEQ is 4; for BGEZ is 1, the code for >= is 1

| 31 4 | 1425 | 8 20 | 2505 |
|------|------|------|------|
|------|------|------|------|



26 21 16

0

#### Assembly Language Version

Recall our running example:

```
.data
                         # begin data segment
       .space 400
                         # allocate 400 bytes
array:
                         # begin code segment
        .text
        .globl main
                         # entry point must be global
main:
        move $t0, $0 # $t0 is used as counter
        la $t1, array
                        # $t1 is pointer into array
        bge $t0, 100, exit# more than 99 iterations?
start:
        sw $t0, 0($t1) # store zero into array
        addi $t0, $t0, 1 # increment counter
        addi $t1, $t1, 4 # increment pointer into array
                        # goto top of loop
            start
exit:
            $ra
                        # return to caller of main ...
```

#### Machine Language Version

```
Encoded: Machine Ins: Source Ins:

0x00004021 addu $8, $0, $0 ; 9: move$t0, $0
0x3c091001 lui $9, 4097 ; 10: la$t1, array
0x29010064 slti $1, $8, 100 ; 11: bge$t0, 100, exit
0x10200005 beq $1, $0, 20
0xad280000 sw $8, 0($9) ; 12: sw$t0, 0($t1)
0x21080001 addi $8, $8, 1 ; 13: addi$t0, $t0, 1
0x21290004 addi $9, $9, 4 ; 14: addi$t1, $t1, 4
0x0810000b j 0x0040002c ; 15: jstart
0x03e00008 jr $31 ; 16: j$ra
```

# Memory Load and Store Operation

#### Overview

#### **ARM Load/Store Instructions**

- The ARM is a Load/Store Architecture:
  - Only load and store instructions can access memory
- Does not support memory to memory data processing operations.
- Must move data values into registers before using them.

# Types of instructions

#### **ARM Load/Store Instructions**

- ARM has three sets of instructions which interact with main memory. These are:
- Single register data transfer (LDR/STR)
- Block data transfer (LDM/STM)
- Single Data Swap (SWP)

```
value at [address] found in Rb
is loaded into register Ra

LDR Ra, [Rb]

STR Ra, [Rb]

value found in register Ra
is stored to [address] found in Rb
```

#### **Basic Load and Store Instruction**

The basic load and store instructions are:

LDR STR Word

LDRB STRB Byte

LDRH STRH Halfword

LDRSB Signed byte load

LDRSH Signed halfword load

|      | Load-Store Instructions        |             |  |  |  |  |
|------|--------------------------------|-------------|--|--|--|--|
| LDR  | load word into a register      | Rd ← mem32  |  |  |  |  |
| STR  | save word from a register      | mem32 ← Rd  |  |  |  |  |
| LDRB | load byte into a register      | Rd <- mem8  |  |  |  |  |
| STRB | save byte from a register      | mem8 ← Rd   |  |  |  |  |
| LDRH | load half word into a register | Rd <- mem16 |  |  |  |  |
| STRH | save half word into a register | Rd -> mem16 |  |  |  |  |

### Syntax and Example

#### **ARM Load/Store Instructions**

- Memory system must support all access sizes
- Syntax:
  - LDR{<cond>}{<size>} Rd, <address>
  - STR{<cond>}{<size>} Rd, <address>

#### e.g.

- -LDR R0, [R1]
- -STR R0,[R1]
- LDREQB R0, [R1]

#### **Load Operation**

#### **Data Transfer: Memory to Register (load)**

- To transfer a word of data, we need to specify two things:
- Register: r0-r15
- Memory address: more difficult
  - -Think of memory as a single one-dimensional array, so we can address it simply by supplying a pointer to a memory address.
  - -There are times when we will want to offset from this pointer.

# Case study: ARM 5 and ARM 7 Architecture

#### Data Sizes and Instruction Sets

- The ARM is a 32-bit architecture.
- When used in relation to the ARM:
  - Byte means 8 bits
  - Halfword means 16 bits (two bytes)
  - Word means 32 bits (four bytes)
- Most ARM's implement two instruction sets
  - 32-bit ARM Instruction Set
  - 16-bit Thumb Instruction Set
- Jazelle cores can also execute Java bytecode

#### **Processor Modes**

- The ARM has seven basic operating modes:
  - User : unprivileged mode under which most tasks run
  - FIQ: entered when a high priority (fast) interrupt is raised
  - IRQ : entered when a low priority (normal) interrupt is raised
  - Supervisor : entered on reset and when a Software Interrupt instruction is executed
  - Abort : used to handle memory access violations
  - Undef: used to handle undefined instructions
  - System : privileged mode using the same registers as user mode

# The ARM Register Set

#### **Current Visible Registers**



# Register Organization Summary



Note: System mode uses the User mode register set

# The Registers

- ARM has 37 registers all of which are 32-bits long.
  - 1 dedicated program counter
  - 1 dedicated current program status register
  - 5 dedicated saved program status registers
  - 30 general purpose registers
- The current processor mode governs which of several banks is accessible.
   Each mode can access
  - a particular set of r0-r12 registers
  - a particular r13 (the stack pointer, sp) and r14 (the link register, lr)
  - the program counter, r15 (pc)
  - the current program status register, cpsr

#### Privileged modes (except System) can also access

a particular spsr (saved program status register)

# Program Status Registers



- Condition code flags
  - N = Negative result from ALU
  - Z = Zero result from ALU
  - C = ALU operation Carried out
  - V = ALU operation o Verflowed
- Sticky Overflow flag Q flag
  - Architecture 5TE/J only
  - · Indicates if saturation has occurred

- Interrupt Disable bits.
  - I = 1: Disables the IRQ.
  - F = 1: Disables the FIQ.
- T Bit
  - Architecture xT only
  - T = 0: Processor in ARM state
  - T = 1: Processor in Thumb state
- Mode bits
  - Specify the processor mode

- J bit
  - Architecture 5TEJ only
  - J = 1: Processor in Jazelle state

## Program Counter (r15)

#### When the processor is executing in ARM state:

- All instructions are 32 bits wide
- All instructions must be word aligned
- Therefore the **pc** value is stored in bits [31:2] with bits [1:0] undefined (as instruction cannot be halfword or byte aligned).

#### When the processor is executing in Thumb state:

- All instructions are 16 bits wide
- All instructions must be halfword aligned
- Therefore the **pc** value is stored in bits [31:1] with bit [0] undefined (as instruction cannot be byte aligned).

#### When the processor is executing in Jazelle state:

- All instructions are 8 bits wide
- Processor performs a word access to read 4 instructions at once

# Development of the ARM Architecture



# Conditional Execution and Flags

 ARM instructions can be made to execute conditionally by postfixing them with the appropriate condition code field.

 This improves code density and performance by reducing the number of forward branch instructions......

```
CMP r3,#0

BEQ skip

ADD r0,r1,r2

skip
```

• By default, data processing instructions do not affect the condition code flags but the flags can be decrement r1 and set flags by using "S". CMP does not need "S".

if Z flag clear then branch

```
loop
...
SUBS r1,r1,#1
BNE loop
```

### **Condition Codes**

 The possible condition codes are listed below: Note AL is the default and does not need to be specified

| Suffix | Description             | Flags tested |
|--------|-------------------------|--------------|
| EQ     | Equal                   | Z=1          |
| NE     | Not equal               | Z=0          |
| CS/HS  | Unsigned higher or same | C=1          |
| CC/LO  | Unsigned lower          | C=0          |
| MI     | Minus                   | N=1          |
| PL     | Positive or Zero        | N=0          |
| VS     | Overflow                | V=1          |
| VC     | No overflow             | V=0          |
| HI     | Unsigned higher         | C=1 & Z=0    |
| LS     | Unsigned lower or same  | C=0 or Z=1   |
| GE     | Greater or equal        | N=V          |
| LT     | Less than               | N!=V         |
| GT     | Greater than            | Z=0 & N=V    |
| LE     | Less than or equal      | Z=1 or N=!V  |
| AL     | Always                  |              |

# Examples of conditional execution

Use a sequence of several conditional instructions

Set the flags, then use various condition codes

```
if (a==0) x=0;
if (a>0) x=1;

CMP r0,#0

MOVEQ r1,#0

MOVGT r1,#1
```

Use conditional compare instructions

```
CMP r0,#4
CMPNE r0,#10
MOVEQ r1,#0
```

### Branch instructions

- Branch: B{<cond>} label
- Branch with Link: BL{<cond>} subroutine label



- The processor core shifts the offset field left by 2 positions, sign-extends it and adds it to the PC
  - ± 32 Mbyte range
  - How to perform longer branches?

# Data processing Instructions

- Consist of :
  - Arithmetic: ADD ADC SUB SBC RSB RSC
  - Logical: AND ORR EOR BIC
  - Comparisons: CMP CMN TST TEQ
  - Data movement: MOV MVN
- These instructions only work on registers, NOT memory.
- Syntax:

```
<Operation>{<cond>}{S} Rd, Rn, Operand2
```

- Comparisons set flags only they do not specify Rd
- Data movement does not specify Rn
- Second operand is sent to the ALU via barrel shifter.

### The Barrel Shifter

#### **LSL**: Logical Left Shift



#### **LSR**: Logical Shift Right



Division by a power of 2

#### **ASR: Arithmetic Right Shift**



#### **ROR: Rotate Right**



Bit rotate with wrap around from LSB to MSB

#### **RRX: Rotate Right Extended**



Single bit rotate with wrap around from CF to MSB

# Using the Barrel Shifter: The Second Operand



### Immediate constants



The assembler converts immediate values to the rotate form:

```
    MOV r0,#4096 ; uses 0x40 ror 26
    ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16
```

• The bitwise complements can also be formed using MVN:

```
MOV r0, #0xffffffff ; assembles to MVN r0,#0
```

· Values that cannot be generated in this way will cause an error.

# Loading 32 bit constants

- To allow larger constants to be loaded, the assembler offers a pseudo-instruction:
  - LDR rd, =const
- This will either:
  - Produce a MOV or MVN instruction to generate the value (if possible).

or

- Generate a LDR instruction with a PC-relative address to read the constant from a literal pool (Constant data area embedded in the code).
- For example
  - LDR r0,=0xFF => MOV r0,#0xFF • LDR r0,=0x55555555 => LDR r0,[PC,#Imm12] ...
    DCD 0x55555555
- This is the recommended way of loading constants into a register

# Multiply

### Syntax:

MUL{<cond>}{S} Rd, Rm, Rs

MLA(<cond>)(S) Rd,Rm,Rs,Rn

• [U|S]MULL{<cond>}{S} RdLo, RdHi, Rm, Rs

• [U|S]MLAL{<cond>}{S} RdLo, RdHi, Rm, Rs

Rd = Rm \* Rs

Rd = (Rm \* Rs) + Rn

RdHi,RdLo := Rm\*Rs

RdHi,RdLo := (Rm\*Rs)+RdHi,RdLo

### Cycle time

- Basic MUL instruction
  - 2-5 cycles on ARM7TDMI
  - 1-3 cycles on StrongARM/XScale
  - 2 cycles on ARM9E/ARM102xE
- +1 cycle for ARM9TDMI (over ARM7TDMI)
- +1 cycle for accumulate (not on 9E though result delay is one cycle longer)
- +1 cycle for "long"
- Above are "general rules" refer to the TRM for the core you are using for the exact details

# Single register data transfer

```
LDR STR Word

LDRB STRB Byte

LDRH STRH Halfword

LDRSB Signed byte load

LDRSH Signed halfword load
```

Memory system must support all access sizes

- Syntax:
  - LDR{<cond>}{<size>} Rd, <address>
  - STR{<cond>}{<size>} Rd, <address>

e.g. LDREQB

### Address accessed

- Address accessed by LDR/STR is specified by a base register plus an offset
- For word and unsigned byte accesses, offset can be
  - An unsigned 12-bit immediate value (ie 0 4095 bytes).

```
LDR r0,[r1,#8]
```

A register, optionally shifted by an immediate value

```
LDR r0,[r1,r2]
LDR r0,[r1,r2,LSL#2]
```

 This can be either added or subtracted from the base register:

```
LDR r0, [r1,#-8]

LDR r0, [r1,-r2]

LDR r0, [r1,-r2,LSL#2]
```

- For halfword and signed halfword / byte, offset can be:
  - An unsigned 8 bit immediate value (ie 0-255 bytes).
  - A register (unshifted).
- Choice of pre-indexed or post-indexed addressing

# Pre or Post Indexed Addressing?



## LDM / STM operation

• Syntax:

<LDM|STM>{<cond>}<addressing\_mode> Rb{!}, <register list>

• 4 addressing modes:

LDMIA / STMIA increment after
LDMIB / STMIB increment before
LDMDA / STMDA decrement after
LDMDB / STMDB decrement before



# Software Interrupt (SWI)



- Causes an exception trap to the SWI hardware vector
- The SWI handler can examine the SWI number to decide what operation has been requested.
- By using the SWI mechanism, an operating system can implement a set of privileged operations which applications running in user mode can request.
- Syntax:
  - SWI{<cond>} <SWI number>

### **PSR Transfer Instructions**



- MRS and MSR allow contents of CPSR / SPSR to be transferred to / from a general purpose register.
- Syntax:
  - MRS{<cond>} Rd,<psr> ; Rd = <psr>
  - MSR{<cond>} <psr[ fields]>,Rm ; <psr[ fields]> = Rm

### where

- <psr> = CPSR or SPSR
- [\_fields] = any combination of `fsxc'
- Also an immediate form
  - MSR{<cond>} <psr\_fields>,#Immediate
- In User Mode, all bits can be read but only the condition flags (\_f) can be written.

### **ARM Branches and Subroutines**

- B <label>
  - PC relative. ±32 Mbyte range.
- BL <subroutine>
  - Stores return address in LR
  - Returning implemented by restoring the PC from LR
  - For non-leaf functions, LR will have to be stacked



### Thumb

- Thumb is a 16-bit instruction set
  - Optimised for code density from C code (~65% of ARM code size)
  - Improved performance from narrow memory
  - Subset of the functionality of the ARM instruction set
- Core has additional execution state Thumb
  - Switch between ARM and Thumb using Bx instruction



### For most instructions generated by compiler:

- Conditional execution is not used
- Source and destination registers identical
- Only Low registers used
- Constants are of limited size
- Inline barrel shifter not used

# Example ARM-based System



### **AMBA**



- AMBA
  - Advanced Microcontroller Bus Architecture
- ADK
  - Complete AMBA Design Kit

- ACT
  - AMBA Compliance Testbench
- PrimeCell
  - ARM's AMBA compliant peripherals