## RUPRECHT-KARLS-UNIVERSITÄT HEIDELBERG



## Arthur Heimbrecht

Bachelor Thesis

HD-KIP-TODOTODOTODO

KIRCHHOFF-INSTITUT FÜR PHYSIK

# Department of Physics and Astronomy University of Heidelberg

## **Bachelor Thesis**

in Physics submitted by

## Arthur Heimbrecht

born in Speyer

**TODO 2123** 

## **Bachelor Thesis**

This Bachelor Thesis has been carried out by Arthur Heimbrecht at the

KIRCHHOFF INSTITUTE FOR PHYSICS
RUPRECHT-KARLS-UNIVERSITÄT HEIDELBERG
under the supervision of
Prof. Dr. Karlheinz Meier

#### **Bachelor Thesis**

As part of the Human Brain Project, BrainScaleS System is a unique project on many levels. This includes a processor solely used by the HICANN-DLS, which manages synaptic weights for every neuron built into one of the many wafers through a special I/O bus. To accelerate the speed at which this so called plasticity processor unit or "nux" computes all synaptic weights of every neuron used, the processor has an extended instruction set architecture that supports vector registers and single input multiple data.

This report deals with the task of adding built-in functions to an existing back-end of GCC, specifically the one already used by nux, in order to extend the already implemented set of functions according to the user's needs.

# Contents

| 1  | Introduction 1                                 |                                                                             |                                   |    |  |  |  |
|----|------------------------------------------------|-----------------------------------------------------------------------------|-----------------------------------|----|--|--|--|
| 2  |                                                | Fundamentals and Applications of Computer Architectures and Compiler Design |                                   |    |  |  |  |
|    | 2.1 Hardware Implementation of Neural Networks |                                                                             |                                   |    |  |  |  |
|    |                                                | 2.1.1                                                                       | Implementation in HICANN-DLSv2    | 4  |  |  |  |
|    |                                                | 2.1.2                                                                       | The Plasticity Processing Unit    | 5  |  |  |  |
|    | 2.2                                            | Basics                                                                      | s of Processor Architecture       | 8  |  |  |  |
|    |                                                | 2.2.1                                                                       | Co-processors and Extensions      | 11 |  |  |  |
|    |                                                |                                                                             | AltiVec Vector Extension          | 12 |  |  |  |
|    | 2.3                                            | Basic                                                                       | Compiler Structure                | 13 |  |  |  |
|    |                                                | 2.3.1                                                                       | Back-End and Code Generation      | 16 |  |  |  |
|    |                                                | 2.3.2                                                                       | Assembly Basics                   | 18 |  |  |  |
|    |                                                | 2.3.3                                                                       | Intrinsics                        | 21 |  |  |  |
|    |                                                | 2.3.4                                                                       | GNU Compiler Collection           | 21 |  |  |  |
|    |                                                |                                                                             | Register Transfer Language Basics | 23 |  |  |  |
| 3  | Fxte                                           | ending                                                                      | the GCC Back-End                  | 26 |  |  |  |
| •  | 3.1                                            |                                                                             | ers                               | 30 |  |  |  |
|    |                                                | gue and Epilogue                                                            | 34                                |    |  |  |  |
|    | 3.3                                            | _                                                                           | d                                 | 35 |  |  |  |
|    | 3.4                                            |                                                                             | ins and Machine Instructions      | 36 |  |  |  |
| 4  | Results and Applications 41                    |                                                                             |                                   |    |  |  |  |
| 5  | Discussion and Outlook 43                      |                                                                             |                                   |    |  |  |  |
| Αŗ | penc                                           | lix                                                                         |                                   | 46 |  |  |  |
| Bi | bliog                                          | raphy                                                                       |                                   | 52 |  |  |  |

## 1 Introduction

As neuromorphic computing becomes more popular, various applications for such systems emerge and make use of analogue hardware's advantages over traditional computers . But to do so, one must be familiar with the system that is used as it often differs in many ways from common computer architectures. Thus programing for such systems can feel odd at times as users need to abandon some familiar techniques and acquire new skills instead. This is a hurdle for many users when developing new experiments and initially takes a significant amount of time.

The more a system abandons common elements of programming which users are accustomed to the more problems can emerge from this. Not only will fewer users take the initiative of writing for such systems but also can code easily get confusing, hard to debug and at worst even ineffective.

An example of this is the current way of programming for the plasticity processing unit of the HICANN-DLS. It is responsible for applying so called plasticity rules to the neuromorphic system and resembles a commonly known processor which was modified for this cause. Despite basic programming still being the same for this processor it differs for creating mentioned plasticity rules. Although these are still programmed in a C environment, a user is given a set of functions and predefined macros that are based on low-level programming.

This was done to include users who want to write programs for the PPU despite being unfamiliar with low-level languages and leads to PPU programs having a distinct look and feel that is only in some regards similar to C. Instead it is like pushing the user back to the origins of computing; for example reading out the value of a variable in memory needs unhandy workarounds and a repetitive set of code instructions.

The main reason we may feel reluctant to such outdated programming are the advantages of compilers, which became a standard for programmers over decades.

From the early stages of computing, compilers have developed towards a standard tool in everyday programming but at the same time became more of a black box that transforms a program into an executable file. For this reason it may be difficult for some users to abandon such convenience and go back to low-level programming.

Luckily the PPU is not completely without compiler support as basic computing is supported but distinct features of the PPU are only usable on a low level. This can easily cause inconvenience for users as these features are necessary to implement learning rules for synapses which are themselves elementary to neuromorphic programming. As of now users need to repeatedly mix high-level with low-level programming when developing for the PPU which may lead to different problems as users have to adapt to

paper on future of neu-

this atypical style of programming. At worst this can even lead to ineffective programs — as performance is important for neuromorphic programming and the main advantage of the PPU — and an unreasonable amount of time and work to achieve simple results.

This makes it obvious that compiler support is desirable as the PPU is yet not fully supported by any compiler. Especially as compilers offer a convenient way of supporting various high-level programming languages.

The reason the PPU does not have compiler support is the custom nature of the processor architecture, which was developed solely for the Brain-ScaleS system and the HICANN-DLS. It can be classified as an application specific instruction set architecture which is an intermediate form between general purpose processors and application specific integrated circuits and offers a partly customized instruction set that is optimized for certain applications.

With and without the help of the PPU the HICANN-DLS already is a platform for many interesting experiments that were conducted or are yet to be conducted in the near future. Applications like in-the-loop training or RSTDP have been developed by various users despite the problem of low-level coding by either not using the PPU at all



Figure 1.1: set-up of a HICANN-DLS test system

or acknowledging the low-level nature of the PPU and learning how to write programs that involve the PPU. Even when taking the effort of learning to code for the PPU, users are constantly challenged by missing capabilities of program code such as creating parameterized functions which leads to repetitive code or difficulties when involving calibration into code.

As we do not want the PPU to be abandoned because of this, even though there is much potential to it, as a hand full of users continuously shows, users should be encouraged to develop applications for the PPU. One way to do so is offering more tools that increase the capabilities of programs while at the same time reducing the effort it takes to develop for the PPU. Besides achieving full high-level programming when working with the PPU, compiler support could also include code optimization and debugging features that help creating more applications like the above. This work therefore aims to achieve compiler support of the PPU's hardware with as many features as possible and simplify programming for existing users what could also advertise the platform to new users and possibly accelerate performance of programs on the PPU. At some point compiler support could also facilitate automatic code generation as a prerequisite for implementation of very high-level languages. That would mean that users do not need to code for the PPU itself but instead create plasticity rules in existing program environments from where code is translated into PPU programs. This creates also the need for optimization of PPU code which could be done by developers specifically for

the PPU but it is far easier and likely more efficient to utilize existing optimization techniques that are built into virtually every compiler.

This thesis will focus on achieving aforementioned compiler support and briefly explain the process itself. As fundamental knowledge of both processors and compilers is needed along the way the second chapter will start with a very basic introduction to both topics and go into detail for the processor and compiler that were used in this thesis. This may not render additional literature obsolete but should explain the basic concepts to an extend which is sufficient. Afterwards the process of extending the compiler is explained into some detail and the result as well as first test cases are presented. The thesis will conclude in a resume and give an outlook to future applications and development of the compiler and the PPU.

Ultimately we want simplify to PPU programming overall and give users tools at hand that allow for many interesting experiments and encourage further development of the whole system.

# 2 Fundamentals and Applications of Computer Architectures and Compiler Design

## 2.1 Hardware Implementation of Neural Networks

As this thesis mainly focuses on a processor that is an essential part of the **HICANN-DLS** (High Input Count Analogue Neural Network-Digital Learning System) we will focus first on the HICANN-DLS as a whole and then look into the **PPU** (Plasticity Processing Unit) in detail while also addressing processor architecture in general.

The HICANN-DLS was built to emulate neural networks at high speeds with low power consumption.

On a very abstract level neurons in the brain resemble nodes in a network. Neurons are interconnected through dendrites, synapses and axons which can be of different coupling strength. In contrast to many other systems we use neuron model that is **spike based**. This means that a neuron is activated only for a short continuous time, called a spike, and sends out this spike through its axon to neurons that are connected via synapses. Between those spikes the neuron is resting and not sending any signals. Synapses can work quite differently but have in common that there is a certain weight associated to them, which we will call synaptic weight. This is equal to a gain with which the pre-synaptic signal is either amplified or attenuated. The signal is then passed through the dendrite of the post-synaptic neuron to the soma where all incoming signals are integrated. If the integrated signals reach a certain threshold the neuron spikes and sends this signal to other neurons (?).

bild synaptic array

#### 2.1.1 Implementation in HICANN-DLSv2

The HICANN-DLS system implements a simplified neural network model in analogue electronics in order to emulate neuronal networks in a biologically plausible parameter range.

At its core it has a so called **synaptic array** that connects 32 neurons which are located on a single chip to 32 different pre-synaptic inputs. Pre-synaptic inputs are arranged along one side of the chip which is orthogonal to the arrangement of the neurons. They enclose a 2D field which will be our synaptic array as it mainly consists of synapse circuits. All neurons reach into the array through conductors that are organized in columns. The pre-synaptic inputs respectively have conductors that resemble rows in the array. At

other systems

each intersection of those conducting lines, a synapse is placed that thereby connects a neuron and a pre-synaptic input. Overall this gives 1024 synapses that interconnect every neuron and every synapse.

A field programmable gate array (FPGA) is connected to all pre-synaptic inputs and feeds external spikes into the ship which can either be computed by the FPGA or routed spikes from another ship. Along an input row the signal of one pre-synaptic input reaches all synapses that connected and is processed individually. Each synapse holds a 6-bit wide **synaptic weight** and an internal decoder address, which both can be changed from outside.

The FPGA sends a 6-bit address whenever it sends a spike to a pre-synaptic input which then is compared by each synapse to the addresses they hold themselves. In case the addresses match, each synapse multiplies an output signal with the weight it stores and sends the result along a column where is reaches a neuron. All signals that are sent, are collected along a column and reach the neuron. Inside the neurons the resulting current input is evaluated in regard to a threshold and other parameters which decide on whether the neuron is spiking or not. If the neuron is spiking it sends an output signal to the FPGA which is responsible for spike routing in the first place. All of this is done continuously and may not follow discrete time steps. Along each column sits a **correlation ADC** (CADC) that converts the signal one neuron receives into digital data for analysis but can also be accessed by the PPU similar to the synaptic array.

The HICANN-DLS is also equipped with a processing unit that includes a vector extension and some memory for it to operate on. This is the plasticity processing unit (PPU) which is also connected to the synapse array for it to read and write synaptic weights.

The will use the following naming convention throughout this thesis:

**Plasticity Processing Unit PPU** is the processor which is part of HICANN-DLS and mainly responsible for plasticity of synapses.

**nux** refers to the complete architecture of the PPU together with its vector extension.

**s2pp** is short for synaptic plasticity processor but we will use this as name of the vector extension.

Synapses in the synapse array are realized as small repetitive circuits that contain 8 bits of accessible information each although only 6 bits are used as weights and the spare two upper bits are used for calibration. The synapse array can also be used in 16-bit mode for higher accuracy that combines the weights of two synapses to a 12-bit weight.

figure synapse

Digital configuration of the synapses and writing PPU programs to the memory is handled by an FPGA that has access to every interface of a HICANN-DLS chip.

#### 2.1.2 The Plasticity Processing Unit

The PPU, which was designed by (*Friedmann et al.*, 2016), is a custom processor, that is based on the Power Instruction Set Architecture (PowerISA), which was developed by

IBM since 1990. Specifically the PPU uses POWER7 which is a successor of the original POWER architecture and was released in 2010 and runs at 100 MHz clock frequency. It is a 32-bit architecture therefore registers are 32 bit wide.

It was developed to handle plasticity and as such applies different plasticity rules to synapses during or in between experiments. This is done much faster by the PPU than by the FPGA which is important for achieving experimental speeds that are  $10^3$  times faster than their biological counterparts. In general the PPU is meant to handle plasticity of the synapses during experiments while the FPGA should be used to initially set up an experiment, manage spike input and record data

The PPU is accompanied by 16 kiB of memory as well as 4 kiB of instruction cache which together is called the plasticity sub-system. The PPU's distinct feature is its special-function unit, the s2pp vector extension (VE), that allows for Single Input Multiple Data (SIMD) operations. The VE is only weakly coupled to the general purpose processor (GPP) of the PPU. Both parts can operate in parallel while interaction is highly limited.

All instructions for the VE must first pass the GPP though, which detects vector instructions and passes them to the VE as it is usual for most processor extensions. These instructions then go into a queue that holds all vector instructions where they are fetched from in order. Going from there, the instructions shortly stay in a reservation station that is specific for each kind of operation and thus allows for little out of order operation for instructions in these reservations stations.

#### figure vector extension

This allows for performing some arithmetic operations on a vector during the process of accessing a different vector in memory. The result is faster processing speed as pipelining for each instruction is also supported. The limiting factor for this though remains the vectors register file's single port for reading and writing.

Although the main limiting factor in processing speed is the memory access, both, the GPP and the VE, share the same MMU and thus any access of the GPP to vectors in memory must be handled with care as, the GPP and VE are not synchronized. The MMU is very simple as it does not cache memory instructions and also has matching virtual and physical addresses.

For this reason one must be aware of the sync instruction that is a memory barrier and stops the GPP from executing instruction until all memory requests of GPP and VE are handled.

#### code example

This can result in up to a few hundred cycles of waiting for memory access to be finished and therefore this should only be done if necessary. sync is a standard instruction of the PowerISA and further information can be found here.

#### reference

The PPU is also able to read out spike counts and additional information through a bus which is accessible through the memory interface. It uses the upper 16 bits of a memory address which are available because the memory is only 16 kiB large which is equivalent to 16-bit addresses. The lower 16 bits are used since the system uses big-endian

maybe different

numbers. A pointer to such a virtual memory address allows to read for example spike counts during an experiment. This is done the same way for the whole chip configuration such as analog neuron parameters.

Besides the VE and the GPP, the memory bus also accessible by the FPGA. This is needed for writing programs into the memory as well as getting results during or after experiments also allows for communication between a "master" FPGA and a "slave" PPU.

Hence the vector unit was equipped with an extra bus that connects to the mentioned synapse array. The synapse array though is alternatively accessible through the main memory bus by setting the first bits similar to the spiking rate information. Using this extra bus or the instructions associated with it is more comfortable and gives more structure to the program but does not increase performance of memory access. As mentioned before, the GPP and VE share a memory bus but vector memory instructions need to pass the VE first which leads to delay that makes inserting a heavy weight memory barrier or "syncing" necessary at times.

Specifically the vector extension allows for either use of 8 element vectors with elements being halfword (1 halfword = 2 bytes) sized or 16 element vectors with each element byte sized Thus every vector is 16 bytes or 128 bits long.

#### figure of vector size

This is also the size of each vector register that is available, which are 32 in total, in contrast to 32 general purpose registers with 32 bit each. The VE also features a vector accumulator of 128 bits width and a vector condition register which holds 3 bits for each half byte of the vector, making 96 bit in total, that determine which condition applies.

To handle the vector unit the instruction set was extended by 53 new vector instructions that partly share their opcodes with existing AltiVec instructions. This renders no problem since the nux does not recognize AltiVec opcodes and most like is not going to in the future. An overview of all opcodes is provided by *Friedmann* (2016), which is recommended as accompanying literature besides this thesis. In general these opcodes are divided into 6 groups of instructions:

modulo halfword/byte instructions apply a modulo operation after every instruction which causes wrap around in case of an overflow at the most significant bit position. Each instruction is provided as halfword (modulo  $2^{16}$ ) and as byte instruction (modulo  $2^{8}$ ).

saturation fractional halfword/byte instructions allow for the results only to be in the range between a maximum that is equivalent to one when we see the MSB as  $2^{-1}$  and the minimum is  $2^{-15}$  for halfword and  $2^{-7}$  for byte instructions.

check with ahartel on

**permute instructions** perform operations on vectors that handle elements of vectors only as a series of bits.

**load/store instructions** move vectors between vector registers and memory or the synapse array.

When using these instruction one must always keep in mind that the weights of the synapses only consist of the latter 6 or 12 bits which are in a vector register and are right aligned. As a user still wants the full functionality and also as much accuracy as possible, a vector is typically shifted left when reading from the synapse array to align the MSB to the very left and thus right shifted when stored in the synapse array.

applications of the PPU today like in-the-loop experiments and controlling

#### 2.2 Basics of Processor Architecture

add reference

cite friedmann disserta-

Almost all contemporary processors are built using the so called von-Neumann architecture. Although the main goal of this project is to provide an alternative analogue architecture that is inspired by nature, there are advantages to classic computing which are needed for some applications.

The main advantage of digital systems over analogue, systems such as the human brain, is the ability to do numeric and logical operations at much higher speeds and precision as well as the availability of existing digital interfaces. For this reason "normal" processors are responsible for handling experiment data as well as configuration of an experiment. We will now shortly touch the basics of such processors and explain common terms.

#### figure classic architecture

In general a microprocessor can be seen as a combination of two units which are an operational section and a control section. The control section is responsible for fetching instructions and operands, interpreting them, controlling their execution as well as reading and writing to the main memory or other buses. The operational section on the other hand saves operands and results only as long as they are needed and performs logic or arithmetic operation on these as instructed by the control section. Prominent parts of the operational section are the **arithmetic logic unit** (ALU) and the **register file** (RF).

The register file can be seen as short-term memory of the processor. It consists of several repeated elements, called **registers**, that save data and have a same size which is determined by the architecture; a 32-bit architecture has 32-bit wide registers.

Typically the number and purpose of registers varies for different architectures. Common purposes of registers are:

general-purpose register GPR These registers can be used for virtually anything and in most cases carry values that are soon to be used by the ALU. Most registers on a processor are typically GPRs. Any register that is not a GPR is called a special purpose register SPR

**link register LR** This register marks the jump point of function calls. This means that after a function completes, the program jumps to the address in the link register.

**condition register CR** This register's value is set by an instruction that compares one or two values in GPRs. Its value can determine a condition for some instructions if they are to be executed or not.

The ALU uses values which are stored in the register file to perform the aforementioned logic or arithmetic operations and saves the results there as well.

Some architectures also have an accumulator that is often part of the ALU. Intermediate results can be stored there because access to the accumulator is the fastest possible but holds only a single value.

It is important to note:

```
speed(accumulator) > speed(register) \gg speed(memory)
```

As speed is always important in computing we therefore want to rather use registers than memory and only write to memory when not enough registers are available or a task has finished.

Registers such as the accumulator can also either be accessible to a user or only accessible to subsections such as the ALU. This is different for every processor architecture and depends on a multitude of factors:

- available chip space
- available time and money for the design
- maximum clock frequency
- energy consumption
- instruction set complexity

These factors influence one another as a complex instruction set allows for complicated arithmetic operations to be performed with a single instruction but this often results in the effective clock frequency being lower due to many micro instructions. During a single clock cycle a chip usually performs one so called **micro instruction** which is part of a **machine instruction**. An example for an add instruction (d = add(a,b)) would be:

complexity > lower clock freq

Listing 2.1: example of micro instructions

- 1. fetch the instruction from memory
- 2. decode instruction
- 3. fetch first operand a
- 4. fetch second operand b
- 5. perform operation on operands
- 6. store result

For more complex machine instructions the amount of micro instructions can be much higher, but such simple machine instructions basically set the minimum amount of micro instructions for any machine instruction. It is obvious that the faster the clock frequency the faster micro instructions can be executed the faster the processor. But at the same time the more complex the instruction set is the fewer machine instructions are needed overall the faster the processor. As mentioned above this results in a trade-off between clock frequency and instruction set complexity.

inhaltlicher zusammen-

book

The instruction set includes all available instructions for an ALU thus the ALU gets more complicated to design and needs more space as the instruction set gets more complex. Because of this, one usually differs between two kinds of processors:

#### **CISC** Complex Instruction Set Computer

#### **RISC** Reduced Instruction Set Computer

The latter usually is reduced to such simple instructions as add or sub and connects them to create more complex instructions. Since the PPU is a RISC architecture we focus on RISC's key values. Simple instructions are similar to micro instructions which were mentioned earlier, but every simple instruction has the same number of micro instructions. RISC architectures therefore start "pipelining" instructions, which means starting the next machine instruction as the previous machine instruction just performed the first micro instruction in a clock cycle. Ideally, this will increase the performance by a factor that is equal to the number of micro instructions in a machine instruction as that many machine instructions can be initiated in the same time it takes to complete a single machine instruction. It must be noted though that the processor has to detect hazards which are data dependencies between instructions where one instruction needs the result of another. Such instructions usually are postponed to a delay-slot and other instructions that do not cause hazards are executed instead. This results in reordering of instructions on a processor level. Also it takes several cycles for memory instructions to load or store data this effectively stalls the processor until the memory instruction has finished. Therefore RISCs try to avoid memory access as much as possible and use registers instead. Luckily a normal RISC architecture provides more registers as the ALU needs less space due to reduced complexity and also can be operated at higher clock frequencies, therefore it is perfect for simple processors that only need to do simple arithmetic as fast as possible.

Next we take a closer look at memory. Normally the memory of a von-Neumann machine contains both, the program and data (this is contrary Harvard architectures). The program here describes a list of instructions that are part of the instruction set. Each instruction itself is represented as a sequence of bits in memory that resemble the following.

#### graphic of opcode

The first part which is called an opcode is simply a number that stands for an operation performed by the ALU. The ALU reads this number and performs the necessary steps. Typically this part is about 8 bits long and has an alias string such as add that is called a mnemonic. The second part is the result which is of the same type as the third and forth part. These are the argument addresses or operands of an operation and can either be a memory address or a register number as both are valid operands. Many RISC architectures have an instruction set that consists exclusively of 3 operand instructions. Any instructions that seem to have less than three operands are normally mapped on instructions that have three operands. It is quite common to use more complex instructions for relatively simple instruction as this reduces the number of opcodes. An example would be moving the contents of register 1 to register 2. This usually maps to an or comparison between register 1 and a register that is all zeros where the result (the same as register 1) is saved in register 2.

It is important to note that in most RISC architectures if we want a memory address as operand, this is done indirectly. A memory address can not be an operand on its own but is loaded into a different register and a different register gets to hold the data

from the memory. This is called a load instruction and its counter part would be a store instruction. Architectures that work like that are called load/store architectures.

This means also that the amount of accessible memory is typically limited by the width of a single register Memory is often seen as blocks and with addresses. Because the smallest amount of information which we are interested in is a byte, each address is equivalent to one byte in memory. Therefore the maximum amount of memory that can be used is:

$$2^{n}byte \xrightarrow{n=32 \text{ bit}} 2^{32}byte \approx 4 * 10^{9}byte = 4GB$$
 (2.1)

Normally though it is not the processor itself that keeps track of the memory. This is usually done by a memory management unit (MMU). It handles all memory access of the processor as it can provide a set of virtual memory addresses which itself then transforms into physical addresses. Most modern MMUs also incorporate a cache that stores memory operations while others are handled and detects dependencies within this cache which it can resolve. This results in faster transfer of data as two or more instructions access the same memory which then is handled in the cache. Not all MMUs support this though and this might lead to certain problems when handling memory. If instructions are reordered due to pipelining and dependencies on the same memory address are not detected, an instruction may write to the memory before a different one could load the previous value it needed. For this reason exist memory barriers. A memory barrier is an instruction that is equal to a guard in code that waits until all load and store instructions issued to the MMU are finished and then allows the code to proceed. It therefore splits the code into instructions issued before the memory barrier and issued after the memory barrier. Even with reordering this prohibits any instruction to be executed on the wrong side of the barrier and thereby ensures conflicting memory instructions to not interfere with one another.

Memory can be split into two popular types which are static random access memory (SRAM) and dynamic random access memory (DRAM). The differ in how bits are set on each RAM. SRAM uses Flip Flops to switch transistors that indicate which bit is set, while DRAM uses capacities that are charged to do so.

We already introduced many parts of a processor which need to be connected somehow. Connections between these parts are called buses and also have a width measured in bytes. Bus speeds are very high as they transport data in parallel such as the contents of a register. Thus most buses should be as wide as a register of the processor. But buses of such width need much space. Therefore some architectures use narrower buses with fewer bits than a register and use two instructions to transfer the contents of a full register. Systems of this sort are described as 32/16-bit architecture, which means that registers are 32 bit wide while buses are only 16 bit wide. As the higher order bits of registers are not as often used as the lower ones this results in less performance loss than initially expected.

#### 2.2.1 Co-processors and Extensions

RISC architectures sometimes need so called co-processors for instructions that are not included in the instruction set but are often enough needed. An example would be

multiplication which would need many cycles when split in add instructions but as part a co-processor can be performed in just a few cycles. In such a case the control section recognizes the mult instruction and passes it to the co-processor and later on fetches the result.

This can be extended to whole units such as the ALU existing in parallel. One example would be a floating point unit (FPU) which is nowadays standard for most processors and handles all instructions on floating point numbers. For this the FPU has its own floating point registers (FPRs) in a separate register file on which it performs instructions and which also have parallel access to the memory.

Another kind of extensions are vector extensions that do the same as the FPU but for vectors instead of floats. This is mostly wanted for highly parallel processes such as graphic rendering or audio and video processing. But also early supercomputers such as the Cray-1 made use of vector processing to gain performance by operating on multiple values simultaneously through a single register. This could either be realized through a parallel architecture or more easily through pipelining the instruction on one vector over its elements. The latter one makes sense since there are typically no dependencies between single elements in the same vector. Nowadays many of the common architectures support vector processing. A few examples of these are:

- x86 with SSE-series and AVX
- IA-32 with MMX
- AMD K6-2 with 3DNow!
- PowerPC with AltiVec and SPE

As mentioned these were mostly intended for speeding up tasks like adjusting the contrast of an image. There is also the possibility to vectorize loops in programming if there are no dependencies between loop cycles.

#### AltiVec Vector Extension

In our case we take a special interest in the AltiVec vector extension which developed by Apple, IBM and Motorola in the mid 1990's and is also known as Vector Media Extension (VMX) and Velocity Engine for the POWER architecture. The AltiVec extension provides the processor with a single-precision floating point and integer SIMD instruction set. The vector register file includes 32 vector registers are each 128-bit wide. AV: These vector registers can either hold sixteen 8-bit chars, eight 16-bit shorts or four 32-bit ints or single precision floats, each signed and unsigned. Single elements of these vectors can only be accessed through memory because there is no instruction that combines scalar register with vector registers. Except for one type of instruction that "splats" the value of a scalar register into all elements of the vector register. The reason we take such an interest in this vector extension is that it resembles most characteristics of the PPU's vector extension and is already implemented in the PowerPC back-end of GCC. There a few differences though:

reference

reference

First the PPU's VE uses a conditional register (CR) to perform instructions only on those elements of a vector register, that meet the condition in the corresponding part of the CR, which is specified by the user, while the AltiVec VE utilizes the CR which included in the PowerPC architecture. This results in not allowing selective operations on individual elements through the CR but allows for checking if all elements meet the condition in a single instruction. If element-wise selection is needed AltiVec offers this through vector masks.

The AltiVec VE has two register on its own though, which are the VCSR and VRSAVE registers. The Vector Status and Control Register (VSCR) is responsible for detecting saturation in vector operations and decides which floating point mode is used. The Vector Save/Restore Register (VRSAVE) assists applications and operation systems by indicating for each VR if it is currently used by a process and thus must be restored in case of an interrupt.

Both of these register are not available in the PPU's VE but would likely not be needed for simple arithmetic tasks as the PPU is meant to perform.

### 2.3 Basic Compiler Structure

At its core every compiler translates a source-language into a target-language, most often it translates a high-level, human readable programming language into a machine language that consists of basic instructions that build complicated structures. In doing so compilers may be the essential part in everyday lives of programmers everywhere. But compilers do not exist as long as computers do and their development played a big role in making computers such an important part of everyday life as they are today. What differs compilers from the competing concept of interpreters is the separation of compile-time and run-time. As interpreters combine these two and translate a program as it is run, a compiler takes the time to read the source-language file completely (often several times) and only then creates the executable files which are run after the process has finished. The advantages of this are simple: While a compiler takes some time at first until the program can be run, the resulting executable is next to always faster and more efficient.

This is due to the possibility of optimizing code during the compilation process and the chance of reading through the source file several times if this is needed (with each time the code is read being called a "pass"). Of course there do exist several compilers today and what matters to the user is typically the combination of the amount of time it takes to compile a program and the performance of that program. Though a compiler is not solely involved into the processing of a programming language towards an executable program. Figure 2.1 illustrates the chain of tools that is involved into this process: As one can see the **preprocessor** modifies the source before it is processed by the compiler and removes comments, substitutes macros and also includes other files into the source. After the compiler is finished with its job the **assembler** takes over and translates the output of the compiler which is written in a language called assembly into actual machine code by substituting the easy-to-read string alternatives with actual opcodes. At last the



Figure 2.1: Schematic overview of different compiler stages.

linker combines the resulting "object-files" that the assembler emitted for different source files with standard library functions that are also already compiled and other resources. The result is a single file that is directly executable. The only task which is left for the loader is assigning a base address to the relative memory references of the "relocatable" file which were used until now. The code is now fully written in machine language and ready for operation.

But since we are more interested in compilers than other components, we will take a better look at the compiler itself. Figure 2.1 shows the common separation of a compiler into front-end, back-end and the optional middle-end. This is done to make a compiler portable, which means allowing the compiler to work for different source-languages which are implemented in the front-end and target-languages which must be specified in the back-end. Therefore if one wants to compile two different programs e.g. one in C the other in FORTRAN, it is necessary to change the front-end but not the back-end because the machine or "target" stays the same. The middle-end in this regard is not always needed but could be responsible for optimizations that are both source-independent and target-independent. Of course the different parts of the compiler have to communicate through a language that all parts can understand or speak. Such a language is called intermediate representation (IR) and also used during different phases of the compilation process. It may differ in its form but always stays a representation of the original program code.

The different phases of a compilation process are illustrated to the far right of figure 2.1. There is no middle-end included into this scheme as it is not a mandatory part of the compiler and would only be responsible for optimizations. But we will take a short look at the other phases: First the source code is fed into the **scanner** that performs lexical analysis, which is combining sequences of characters to symbols of something called tokens that get associated with an attribute such as "number" or "plus-sign" and the symbol. Next the **parser** takes the sequence of tokens and builds a syntax tree that represents the structure of a program and is extended by the **semantic analyzer** which adds known attributes at compile-time like "integer" or "array of integers" and checks if the resulting combinations of attributes are valid. This already is the first form of IR. The **source code optimizer** which is the last phase of the front-end takes the syntax tree and takes the first shoot at optimizing the code. Typically only light optimization is possible at this point such as pre-computing simple arithmetic instructions and different kinds of optimization exist. After the source code optimizer is done the syntax tree is converted to intermediate representation in order to be passed to the back-end.

The **code generator** takes this IR and translates it to machine code that fits the target - typically this is assembly. At last the **target code optimizer** tries to apply target-specific optimization until the target code can be emitted.

During these phases the compiler also generates a symbol and literal table. A symbol table is as the name states an overview of all symbols that are used in the program, it contains the symbols name and the attribute of the semantic analyzer. A literal table in contrast holds constants and strings and makes them available globally by reference, as does the symbol table. This information is used by the code generator and various optimization processes.

#### 2.3.1 Back-End and Code Generation

We now want to focus a little more on the last two phases of a compiler, which are also part of the back-end. We already stated that the back-end is responsible for code generation and target optimization and since we will keep focus on the back-end later on, we need to get used to a few other terms that are common when talking about compiler back-ends.

Usually the processes of code generation and target optimization are entangled as optimization can take place at different phases of code generation. Thus we first take a look at code generation in the back-end.

As we learned already, the source program reaches the back-end in form of IR. Often the IR is already linearized and thereby again in a form that can be seen as sequence of instructions. Because of this the IR may also be referred to as Intermediate Code. The process of generating actual machine code from this is again split into different phases:

- instruction selection
- instruction scheduling
- register allocation

At first the back-end recognizes sets of instructions in intermediate code that can be expressed as an equivalent machine instruction. Depending on the complexity of the instruction set a single machine instruction can combine several IR instructions. This may involve additional information that the front-end aggregated and added to the IR as attributes single machine instruction can combine several IR instructions. At the end of this is a compiler typically emits a sequence of assembly instructions which we will explain later on. In order to fulfill this task the compiler needs the specifications of the target it compiles for. This is called a target description and can contain things like specifications of the register-set, restrictions and alignment in memory and availability of extensions and functionalities. The compiler also needs knowledge of the instruction set of a target sometimes referred to as the ISA which is in essence a list of instructions which are available and also their functionality. The compiler picks instructions according to their functionality from this list and substitutes the IR with this. Ideally a back-end thus could support different back-ends just by exchanging the machine description and the ISA as the basic methods of generating code are the same for most targets.

After the IR is converted into machine instructions the back-end now rearranges the sequence of instruction. This needs to be done as different instructions take different amounts of time to be executed. If a subsequent instruction depends on the result of a previous instruction the compiler now has two alternative approaches to solve this. First it can simply stall the programs execution as long as the instruction is executed and feed the next instruction into the processor only when the dependency is solved. This means that the compiler adds nops before every instruction that needs to wait for an operand as nop tells the processor to wait until the previous instruction has finished. For critical memory usage the compiler can also insert syncs as memory barriers before hazardous memory instructions. Alternatively it can stall only the instruction which depends on

the result which is currently computed but perform instructions that do not depend on the result in the mean time. By doing so the scheduler increases performance noticeably and thus can partly be seen as part of the optimization process. On RISC architectures this is especially important as load and store instructions can take a few hundred times more clock cycles than normal register instructions and pipelining depends mainly on the instruction sequence. Thus the scheduler is also involved parallelization of code. As a result of this a compiler would usually accumulate all load instructions at he beginning of a procedure and start computing on registers that already have a value while the others are still loaded. This is done vice versa at the end of a procedure for storing the results in memory. This process of course needs the compiler to know the amount of time it takes for an instruction to be executed and works hand in hand with hazard detection on processor level.

At last he compiler handles register allocation which also includes memory handling. Typically the previous processes expect an ideal target machine which provides and endless amount of registers. As in reality the processor only has k registers the register allocator reduces the number of "virtual registers" or "pseudoregisters" that are requested to the available number of "hard registers" k. For this to be possible the compiler decides whether a value can live throughout a procedure in a register or must be placed into memory because there are not enough registers available. This results in the allocator adding load and store instructions to the machine code in order to temporarily save those registers in memory which is called "spilling". It is obvious that this can hurt performance and therefore the compiler tries to keep spilling of registers to a minimum and also insert spill code at places where it delays other instructions as little as possible. At he end of register allocation the compiler assigns hard registers to the virtual registers which are now only k at a time.

During and after code generation the compiler also applies optimizations to the machine code. Any optimization to the code though must take three things into consideration, which are safety, profitability and risk/effort. The first thing which always must be fulfilled, is safety. Only if the compiler can guarantee that an optimization does not change the result of the transformed code compared to the original code it may use this optimization! Only if this applies the compiler may check for the profit of an optimization which most often is a gain in performance but could also be the size of the program. At last the effort or time it takes for the compiler to perform this optimization and the risk of generating actually bad or ineffective code should be taken into account as well. If optimization passes these three aspects it may be applied to the code. In the end there exist some simple optimizations that always pass this test like the deletion of unnecessary actions or unreachable code, e.g. functions that are never called. Another example would be the reordering of code like the scheduler did before or the elimination of redundant code, which applies if the same value is computed at different points and thus the first result simply can be saved in a register. If a compiler knows the specific costs of instructions, it can also try to substitute general instruction with more specialized but faster instructions, like substituting a multiplication with 2 by shifting a value one position to the left. There exist many more ways of optimization but we only want to explain one more kind of optimization which is called peephole optimization.

In peephole optimization the compiler only looks at a small amount of code through a "peephole" and tries to find a substitution for this specific sequence of instructions. These substitutions must be specified by hand and are highly target-dependent in contrast to the optimizations which were mentioned before that are target-independent. If the sequence can be substituted the peephole optimizer does so, otherwise the peephole is moved one step further and the new sequence is evaluated.

#### 2.3.2 Assembly Basics

Assembly (asm) was mentioned a few times in this thesis already and we need to know at least a few basic assembly instructions that will help us later on to understand resulting machine code. Assembly is usually the lowest level of representation of a program that still is human-readable. Assembly code is basically equivalent to machine code and therefore by many seen as such though assembly code needs to be assembled by the assembler first. Assembly instructions all follow a certain scheme which is:

```
add r1, 0x3000, 5
mnemonic operand/result operand operand
```

For RISC architectures instructions typically consist of 3 operands because operations are usually between registers only (except for load/store a.k.a. memory instructions). The mnemonic in most cases is a named after the first letters of the instructions full name, which is emphasized in the following table. The operand can be of three different types which are all shown above. They either represent a specific register (r1 = register 1), a memory address (0x3000 = the value at memory location <math>0x3000) or an immediate value (5 = the integer 5). Register operands can also have an indirect use, which means that that content of the register is taken into account. I.e. a memory address can be saved to the register and an operation uses the value at the memory location which the register refers to.

| mnemonic | operands           | description                                                      |
|----------|--------------------|------------------------------------------------------------------|
| add      | RT, RA, RB         | add RB to RA and store the result in RT                          |
| addi     | RT, RA, SI         | add SI to RA and store the result in RT                          |
| addis    | RT, RA, SI         | add SI shifted left by 16 bit to RA and store the result         |
|          |                    | in RT                                                            |
| and      | RA, RS, RB         | RS and RB are anded and the result is stored in RT               |
| b        | target_addr        | branch to the code at target_addr                                |
| ble      | BF, target_addr    | branch to the code at target_addr if BF is less or               |
|          |                    | equal                                                            |
| blr      |                    | branch to the code at address in the linker register             |
| cmp      | BF, L, RA, RB      | RA and RB are compared and the result (gt,lt,eq) is              |
|          |                    | stored in Bf, L depicts if 32-bit or 64-bit are compared         |
| cmplwi   | BF, RA, SI         | RA compared logically wordwise with immediate SI                 |
|          |                    | and the result is stored in BF                                   |
| and      | RA, RS, RB         | RS and RB are anded and the result is stored in RT               |
| eieio    |                    | RS and RB are anded and the result is stored in RT               |
|          |                    | EDITTHIS                                                         |
| isync    |                    | RS and RB are anded and the result is stored in RT               |
|          |                    | EDITTHIS                                                         |
| la       | RT, D(RA)          | load aggregate D + RA into RT                                    |
| li       | RT, SI             | load immediate value SI into RT                                  |
| lis      | RT, SI             | load immediate value SI shifted left by 16 bit into RT           |
| lbz      | RT, D(RA)          | load byte at address D+RA into RT, fill the other bits           |
|          |                    | with $\mathbf{z}$ eros                                           |
| lwz      | RT, D(RA)          | load word at address D+RA into RT, fill the other bits           |
|          |                    | with $\mathbf{z}$ eros                                           |
| mflr     | RT                 | move from linker register to RT                                  |
| mr       | RT, RA             | move register RA to RT                                           |
| nop      |                    | halts execution until the previous instructions are fin-         |
|          |                    | ished EDITHIS                                                    |
| rlwinm   | RA, RS, SH, MB, ME | rotate left word in RS by immediate SH bits then and             |
|          |                    | with $\mathbf{m}$ ask which is 1 from MB+32 to ME+32 and 0 else, |
|          |                    | store to RA                                                      |
| stw      | RS, D(RA)          | store word from RS to address D+RA                               |
| stwu     | RS, D(RA)          | store word from RS to address D+RA and update RA                 |
|          |                    | to D+RA                                                          |
| sync     |                    | halts execution until the memory controller is finished          |
|          |                    | EDITTHIS                                                         |

caption and reference to PPC book and asm website correct the table

Obviously the mnemonics follow a certain pattern that has letters which can be interchanged to alter the meaning of the mnemonic, some of these characters are:

i indicates that the instructions uses an immediate value

2 Fundamentals and Applications of Computer Architectures and Compiler Design

b stands for byte and references the size of the operand

stands for halfword and references the size of the operand

stands for word and references the size of the operand

indicates that one of the operands is shifted

### g, ge, l, le, e

stand for greater, greater or equal, less, less or equal and equal which is the possible content of the conditional register

There are also special operands which might occur in asm which behave like pointers:

@1(C)

is equivalent to the lower order 16 bits of c in the symbol table

#### @ha(C)

is equivalent to the higher order 16 bits of  $\mathfrak c$  in the symbol table and minds the sign extension

Additionally there exist markers which are intended for debugging:

#### .loc # # #

marks a line of code (file, line, column) in the source file

.LVL

is a local label which can be discarded

.LFB

marks the begin of a function

.LFE

marks the end of a function

.LCO

is a constant of the literals table at position 0

#### all these tables in the appendix

We want to spend just a little more time on assembly as it is useful to know how to program in assembly while using C. This is done by the following scheme:

```
asm volatile ( "add %0, %1, %2"
: "=r" (dst)
: "r" (src1), "r" (src2):);
```

The line of code above tells the compiler to generate the instruction add in assembly which is followed by three operands. The number n in %n indicates that the operand is specified by the n+1th description of an operand that follows. The description that follows after: describes the output operands. "=r" means that the output is to be stored in a register (letter r for register operand) and that the register is to be written (= this is called constraint). The variable in parentheses must be declared before its occurrence and of matching type (float would not be allowed in this case). The following description is that of the input operands and those must not be written! r again stands for a register operand and the variable is in parentheses, the arguments are separated by commas. After the third: follow clobbered (=temporarily used) registers which would also be in quotes, but these are optional arguments. volatile means that the compiler must not delete the following instructions due to optimization.

As a special command asm (:::memory); would indicate a memory barrier to the compiler, ergo the machine instructions previous to this line and the one following may not be interchanged.

#### 2.3.3 Intrinsics

Something that will also occur quite often later in this thesis are intrinsics. Intrinsics are sometimes also called built-in functions and resemble an intermediate form of asm and a high-level programming language. This means that by calling an intrinsic function, we tell the compiler to use a certain machine instruction that typically shares its name with the intrinsic. What differs an intrinsic from asm() is that we do not need to specify constraints or registers classes but only need to keep an eye on the type of arguments. One could easily mistake them for normal functions of library but they are directly integrated into the back-end of a compiler and thus independent of the programming language. In order to implement intrinsics into a back-end the compiler need a certain knowledge of what the asm instruction does and what kind of operands it needs.

A typical field of application for intrinsics would be vectorization and parallelization of code through processor extensions. Sometimes this is the sole option of using the machine instructions associated with them.

#### 2.3.4 GNU Compiler Collection

The GNU Compiler Collection (GCC) is a compiler suite that supports different programming languages and targets. Though normally it is seen as a build of GCC supports a variety of front-ends while it was built for a specific target. This target in most cases is the processor architecture on which the user runs the compiler. But GCC also supports the idea of a cross-compiler which is the concept of compiling code on one machine but running the code on a different machine that is also based on a different architecture. One must though build a version of GCC locally for every back-end one wants to compile code for. This is realized through a modular structure which follows the idea of a front-end, middle-end and back-end as it was described in section 2.3 although some information that belongs to a back-end is also needed at the front-end, hence the compiler is built

back-end specific but supports a wide variety of back-ends to choose from.

GCC itself is programmed in C++ and part of the GNU project of the Free Software Foundation. It is wide-spread and one of the most popular compilers especially among academic institutions and small scale developers. Every major UNIX distribution and many minor ones include GCC as a standard compiler. von Hagen (2006) As an open source project there is a constant development to the compiler and there exist many threads that support known bugs.

There is one major competitor though which stands besides GCC as an open source compiler suite which is Clang that is part of the LLVM (low level virtual machine). Both support running the same source code on multiple machines while LLVM actually runs intermediate code rather than actual machine code and uses GCC to generate this intermediate code for some front-ends. However while one can argue in favor for either one, GCC seems a little better suited for our application. These results have to be viewed with care, as they are based on different processor architectures but it seems like both compilers provide similar performance. Ultimately it is the personal preference of the programmer that decides which compiler one is more comfortable with and often enough he chooses that compiler. In our case I have chosen GCC over LLVM for two main reasons. One is that after all GCC follows more the traditional concept of a compiler that generates machine code at the end and also I was far more familiar with GCC than with LLVM when this decision had to be made. The other is that GCC support existed to a minimum before I started this thesis and thus there was a point to start from. This topic will be referred to later on in the discussion but for now a short motivation seems to be sufficient.

By now GCC is a stable release version of 6.3 with version 7 in the works but we will use an older version which is used internally at the BrainScaleS project which is version 4.9.2. Additionally we will use binutils 2.25 which was patched by Simon Friedmann and since includes the opcodes and mnemonics which are supported by the nux. A complete specification of the libraries used and a handy script that builds a cross-compiler for the nux on PowerPC systems can be found there as well.

We take a special interest in the PowerPC back-end of GCC which is called rs/6000 for IBMs RISC system/6000 architecture that is equivalent to POWER. According to GCCs Internals manual GCC (2017a), which we will refer to as the sole source of information in this regard, the back-end of GCC has the following structure:

Each architecture has a directory with its respective name in gcc/config/rs6000 that contains a minimum amount of files. These are the machine description rs6000.md which is an overview of machine instructions with additional information to each instruction and the header files rs6000.h and rs6000-protos.h and source file rs6000.c that handle the target description through macros and functions. Every back-end needs these files in the GCC source and the final back-end is build from these files through the macros and functions just mentioned. To notice a back-end in the first place the back-ends directory -here "rs6000"- must be added to the file config.gcc which also includes a list of all files in the aforementioned directory. Most back-ends include additional files which makes a back-ends complex structure clearer but these are not mandatory and we will address these later.

reference LLVM vs. GCC on ARM, LLVM vs. Gcc in EISC Instead we address one of the most important functions which unfortunately is also one of the least documented though most complicated ones. The function/process is called "reload" and is used as part of the register allocation process. GCC (2013) Specifically reload is meant to do register spilling but over almost 25 years that GCC existed until 4.9.2 it became more and more complex and basically does everything associated with register allocation (mainly moving the contents of different registers and memory around, and finding the right registers in the first place). Over the years it thus became one of the main sources of errors when constructing a back-end and was meant to be replaced several times. As of now reload is being replaced by LRA (local register allocator) but GCC 4.9.2 is not impacted by this therefore we are stuck with reload indefinitely.

To address possible errors in reload later on we now get to know one form of IR in GCC that is Register Transfer Language (RTL).

#### Register Transfer Language Basics

RTL, which is not to be mixed up with Register Transfer Level, is a form of IR the backend uses to generate machine code. Usually GCC uses the IR GIMPLE which looks like stripped down C code with 3 argument expressions, temporary variables and goto control structures. The back-end transforms this into a less readable IR that inherits GIMPLEs structure but brings it to a machine instruction level. It is inspired by Lisp lists and thus we will need to take a look at those at last in before we take on the task of extending a GCC back-end.

We do so in looking at one of the most fundamental RTL statements first while explaining the each part at a time.

There exist manuals to basically everything which is written here and the more extensive manual will be referenced at he end of each paragraph.

define\_insn is an RTL expression that generates an RTL equivalent to a machine instruction. One such instruction is called an insn (short for instruction) hand has several properties like a name, an RTL template, a condition template, an output template and attributes. GCC (2017g) The name in this case is add<mode>3 (3 for three operands) where <mode> is to replaced by a set of values that describe a modes. GCC (2017p) A

mode is the form of an operand and can be something like si for single integer, qi for quarter integer (quarter the bits of a single integer), sf for single float or v16qi for a vector of 16 elements which are quarter integers each. GCC (2017k) There are many more modes that follow the same scheme. In this case we do not specify the mode explicitly but use an iterator that creates a define\_insn for every valid mode we specify. GCC (2017j) The second define\_insn shows this with a specific mode.

Next follows the RTL template which is in square brackets. All RTL templates need a side effect expression as a base which describes what happens to the operands that follow. In our case set means that the value which is specified by the second expression is stored into the place specified by the first expression. GCC (2017o) The first expression that follows is a specified operand. match\_operand tells the compiler that what follows is a new operand. VI2 belongs to the mode iterator we saw earlier and is to be substituted by the equivalent mode to <mode> in caps, which can be seen for the following define\_insn. All modes VI2 are to be substituted by the same real mode. After the mode comes the index of an operand which starts at 0 for every define\_insn. The following string describes a predicate which tells the compiler more about the operand and which constraints it must fulfill. Operands typically end in \_operand and a single predicate is meant to group several different operand types. In this case any register would be a valid operand. GCC (2017m) The next string specifies the operands further and is meant to fine tune the predicate. It is called a constraint and matches the description which was taken in section 2.3.2. = again means that the register must be writable and v stands for an AltiVec vector register.GCC (2017e) This pattern is repeated for every operand and only changes slightly. Though the second expression of the set side effect has an additional pair of parentheses because of the plus statement. This is an arithmetic expression and tells the compiler that the following operands are part of an operation that results in a new value. It is also succeeded by a mode that specifies the mode of the result.GCC (2017b)

The RTL template is matched by the compiler against the RTL it generated from GIMPLE and if the template matches the RTL is substituted by the output template that follows.

After the RTL template is finished, the condition specifies if the insn may be used. It is a C expression an must render true in order to allow the matching RTL pattern to be applied. In this case the condition is also depending on the mode iterator which substitutes <VI\_unit> for equivalent code to that of the next define\_insn with a matching mode. GCC (2017g)

The output template is usually is similar to the asm template from asm. The string contains the mnemonic of a machine instruction and the operands which are numbered according to the indexes of the RTL template. Again this is depending on the mode iterator and <VI\_char> will be substituted by a character that belongs to a machine mode. GCC (2017g)

At last the insn is completed by its attributes which hold further information about the insn that is used by the compiler internally like which effect an insn has on certain register etc.. We are less interested in this, as attributes are optional and we do not add attributes to the back-end. GCC (2017d) The attentive reader might have noticed that only RTL template is written in RTL. This is true but still do inso patterns belong into this section. The RTL not only is the most important part of an inso but we will hardly see RTL outside from RTL templates. Still should RTL be mentioned in its other form here as it is used for debugging purposes.

RTL can be split into two phases which are non-strict RTL and strict RTL. Non-strict occurs only before reload and is very deliberate in specifying its operands. Operands usually are virtual registers that have a unique number. match\_operand then is replaced by (reg:SI 1) which tells the compiler the type of operand, the mode and the register number.GCC (2017n)

Strict RTL has passed reload and no longer contains virtual registers but only references existing hard registers or memory.

An example of non-strict RTL and strict RTL of the same code can be seen in figure

examples of RTL code

MMU?= memory controller PPC must handle syncing in compiler when I/O is added explain stack and frame pointer PPU instruction set

## 3 Extending the GCC Back-End

Since we now are familiar with the basics of processors, the PPU, compilers and GCC we can put this knowledge to use and start extending the GCC back-end. There es a set of files we will systematically edit and keep referencing to as they are important parts of the rs6000 back-end and were changed in the process of extending the back-end.

rs6000.md this is the machine description of the back-end in general and contains the insn definitions for all scalar functions

rs6000.h is a header file which contains all macros and declaration of registers

rs6000.c is the source file which implements all functions that are needed for generation of the back-end and other accompanying functions

rs6000.opt lists the options and flags which can be set for the target

rs6000-builtins.def contains the definitions of built-ins/intrinsics

rs6000-cpus.def lists subtargets that belong to the rs6000 family

rs6000-c.c links built-ins to overloaded built-ins

rs6000-opts.h contains a set of enumerations that represent option values for the backend

rs6000-protos.h makes functions in rs6000.c globally available

rs6000-tables.opt lists string/int equivalent to an enumeration which lists CPU types

driver-rs6000.c mostly a collection of driver details for different targets

ppc-asm.h sets macros for the use of asm

**s2pp.md** is a new file that is the machine description of the nux' VE and contains inso definitions for the VE

s2pp.h is the header file that defines aliases for built-ins

constraints.md contains definitions of the constraints that are used

predicates.md contains definitions of the predicates that are used

vector.md declares inso definitions that apply to vector types in general

## sysv4.h initializes a variety of option flags an sets default values

## t-fprules sets soft-float as default for certain targets

Before we start it should be emphasized where inspiration was taken from in order to take the right steps when extending the back-end. There exists no specific documentation on the rs6000 back-end whatsoever and most information on the ways it works is available through comments in code only. There is however an existing vector extension which is included in the back-end and is quite similar to the vector extension we want to add. The existing vector extension is the AltiVec Vector extension, we described earlier in this thesis. The source code thus often shows strong similarities between AltiVec and s2pp functions. Still handling AltiVec and s2pp vectors is separated throughout the back-end since mixing these and reusing function can both get messy and dangerous since at times one vector extension must be distinguished from another. Also there do exist enough differences between these two VEs that combining functions and having separate ones would not make a big difference in the end.

We will start with adding the nux processor to the list of targets and also want to include mandatory flags with this. Ideally the user only has to add the option flag -mcpu =nux when compiling in order to produce machine code for the nux. the flags which have to be set when using the nux are:

#### -msdata=none

disables the use of a "small data section" which is like a data section but has a register constantly referring to it and thus has faster access than the normal data section. Globals, statics and small variables that are often used are preferably stored there.

### -mstrict-align

aligns all variable in memory which means that a variable always starts at a memory address without offset. Every variable requests at least 1 byte of memory when strictly aligned

### -msoft-float

tells the compiler that there is no FPU and all floating point operations have to simulated by software.

### -mno-relocatable

states that the program code has a fixed memory address that may not be altered.

To do so we first create the flag -ms2pp along with an option mask. This should activate the vector extension and everything associated with it. In rs6000.opt and we simply need to add the lines:

```
ms2pp
Target Report Mask(S2PP) Var(rs6000_isa_flags)
Use s2pp instructions
```

ms2pp is the name of the target flag and the next lines defines which macros shall be defined. Target means that the option is target specific, Report means that the option is to be printed when -fverbose-asm is activated. Mask(S2PP) initializes a bitmask that is available through OPTION\_MASK\_S2PP which is attached to rs6000\_isa\_flags specified by Var and in parallel specifies a macro TARGET\_S2PP that is set to 1. GCC (20171) When we are finished with this we need to specify #define MASK\_S2PP OPTION\_MASK\_S2PP which is done in rs6000.h as MASK\_ is the standard.

Now we add the processor type which uses this flag as a standard. There exist several lists that contain available targets and we need to add the nux to these. First we add the asm flag that tells the assembler which system architecture is used. As the nux is based on Power7 and the VE does not influence asm for the GPP we want nux to behave like Power7. To do so we add %{mcpu=nux: %(asm\_cpu\_power7)} to ASM\_CPU\_SPEC and simultaneously add { "nux", "%(asm\_cpu\_power7)" } to static const struct asm\_name asm\_names[] as these arrays must be kept identical. This will induce the assembler flag to be set to -mpower7.

Now we want to achieve something similar for the preceding phases of the back-end as as the description of flags above is mandatory to running nux. First we will create an entry in rs6000-cpus.def which sets the flags accordingly:

```
RS6000_CPU ("nux", PROCESSOR_POWER7, MASK_SOFT_FLOAT | MASK_S2PP | MASK_STRICT_ALIGN | !MASK_RELOCATABLE)
```

will define the processor as a Power7 type and add the masks for soft-float, strict-align and no-relocatable(negative of relocatable) as well as the new s2pp mask. Setting a mask is the low-level equivalent to adding a flag as every flag induces a mask, which we saw earlier, and thus this will fit our cause. Calling RS6000\_CPU will basically generate an entry to the enumeration rs6000\_cpu\_opt\_value which must be linked to a string following -mcpu=. This is done in rs6000-tables.opt where a string, in our case nux, is associated with the n-1-th call of RS6000\_CPU through:

```
EnumValue
Enum(rs6000_cpu_opt_value) String(nux) Value(n)
```

One must basically count the calls of RS6000\_CPU to get n.

As we are still missing the -msdata=none flag, we will take care of this now. Because -msdata is not an ordinary target flag, we need to initialize it differently. Depending on the string following -msdata= the back-end decides which mask is referred to when calling rs6000\_sdata. This is decided in sysv4.h through comparing with different valid strings. Since we were only able to set option masks before we need a little work-around here that helps us setting the value of rs6000\_sdata. If -msdata was not defined as it is the case with -mcpu=nux the compiler will check through else ifs what to do for rs6000\_sdata. We simply add one such check in front of all other else ifs and check for all the target flags that -mcpu=nux sets. Hence the target options will also set rs6000\_sdata as we wish which is to SDATA\_NONE. Although there exists a case for which this condition applies even if nux is not set as target and that is when all target flags are set by hand. Still if one chooses an explicit value for -msdata the option does

not apply and the explicit value is taken into account. This is somewhat not the ideal way of solving this problem but fulfills its purpose with as few side effects as possible.

Already this would allow us to use -mcpu=nux as target flag and -ms2pp as option flag. But since the flags we used are mandatory to the s2pp extension we also want to check these flags before starting compilation. First though we must create macros for each flag which the back-end notices. This is done in rs6000-c.c in function rs6000\_target\_modify\_macros where we can define macros depending on target flags.

```
if ((flags & OPTION_MASK_S2PP) != 0)
  rs6000_define_or_undefine_macro (define_p, "__S2PP__");
```

If flags and the respective option mask are set, rs6000\_define\_or\_undefine\_macro will define a macro which is the second argument if a macro is defined or undefined depends on the boolean define\_p but the back-end takes care of this on its own. We do this for all flags we are interested in. Now we will use these new macros to check if the flags are set. Before we do this though we create a new file that is s2pp,h and index this in gcc/config.gcc under extra\_headers at powerpc\*-\*-\*. We do this so GCC actually invokes the header file as it is not referenced elsewhere. After we added the usual conditional header macro to the file we add the first lines of code which are:

```
#if !defined(__S2PP__)
#error Use the "-ms2pp" flag to enable s2pp support
#endif
```

Hence if \_\_S2PP\_\_ was not defined because the target flag was not set, the compiler will emit an error that tells the user to set the target flag. This is done for all macros that we set above so to prohibit false use of nux. But we will also put in a second measure to prevent false use of floats by adding nux to the list of soft-float CPUs in t-fprules.

do more often like this: one example for repetitive use

Since the preliminary requirements for the activation of the VE are now met, we start adding the remaining fundamental macros and also add a vector attribute that the user may use later. First we add a new Vector type which we will need from time to time. We call it VECTOR\_S2PP and add it to the enumeration rs6000\_vector in rs6000-opts.h. To put this to use we will create macros in rs6000.h which test vector modes if they correspond to the available modes in s2pp. Hence we add many macros of this form which take and argument and compare it to a valid s2pp type/value.

checks the mode's associated vector unit and compares this to VECTOR\_S2PP. We also add checking for specific vector modes which shall be chosen carefully. The hardware only supports two types of vectors which are vectors with byte elements and vectors with halfword elements. Thus the respective modes are V16QImode and V8HImode which can be checked by S2PP\_VECTOR\_MODE(MODE).

We also need to find conditions which should apply to TARGET\_S2PP as they already to for TARGET\_ALTIVEC. Luckily there exist only two options which prohibit misalignment of vectors and control the right alignment. Besides that only a few functions which we will discuss later need TARGET\_S2PP to be called. Other options such a permutation of vectors

are not supported by s2pp or need a different implementation all together. The same procedure must be applied for checking vector modes by checking using VECTOR\_UNIT\_S2PP\_P and similar macros. These are mainly used in rs6000.c when addressing vectors in memory through registers as operands. Similar but more complex implementations will build the rest of this chapter and therefore we will not go into detail for every macro like

```
#define UNITS_PER_S2PP_WORD 16
```

that is used.

Now that we have added the s2pp vector we identify it with the available modes which is done in rs6000.c. In rs6000\_init\_hard\_regno\_mode\_ok we check for the s2pp target flag and assign VECTOR\_S2PP to the modes V16QImode and V8HImode. Also we add the modes for preferred modes when vectorizing a value in rs6000\_preferred\_simd\_mode and decline the usage of offset addressing for these modes in reg\_offset\_addressing\_ok\_p.

We will now define an attribute to use these vector modes. Thankfully GCC already supports a vector attribute which is also used by AltiVec. Thus we can add s2pp to the rs6000\_attribute\_table and rs6000\_opt\_mask[] array with the same values as for AltiVec but changing the keyword. This forces us to add the function rs6000\_handle\_s2pp\_attribute which is the same as the AltiVec version but stripped off some vector modes which are not supported. We could use those attributes already but they would come in quite unhandy thus we will define an alias to represent the new attribute in rs6000-c.c:

```
builtin_define ("__vector=__attribute__((s2pp(vector__)))");
```

This is the same for AltiVec and copies the usage of vectors in AltiVec. As does our addition of an s2pp case in rs6000\_attribute\_takes\_identifier\_p to complete the attribute.

## 3.1 Registers

Now we start adding the registers of s2pp to the back-end. There are three types of registers we need to add:

- **32 vector registers** these are normal registers that hold vector values
- 1 accumulator which is used for chaining arithmetic instructions and cannot be accessed directly
- 1 conditional register which holds conditional bits and also cannot be accessed directly

Registers are declared in rs6000.md thus we will do so as well.

```
(S2PP_COND_REGNO 32)
(FIRST_S2PP_REGNO 33)
(LAST_S2PP_REGNO 63)
(S2PP_ACC_REGNO 64)
```

necessary

check this refernce

explain this later

is added to the Definition of constants at the beginning of rs6000.md, where we give a range for the vector registers and after which we pick specific register numbers for the special vector registers. The number of these special vector registers does not matter though as we only need a number which no other registers that is used by nux refers to. There is a reason why we put the conditional register at 32 and only provide 31 vector registers: As the GPRs need the first 32 registers numbers (0-31) and there is never an FPU in nux, we want to use the 32 registers normally reserved to FPRs. However during the first tests of nux with its vector extension it became apparent that a reserved register with all bits set to 0 would come in handy at times. For one reason nux' instruction set does not include logical instructions especially no exclusive or (xor) function which is the standard way of having a registers set to all os (since xoring a value/register with the same value/register will always return 0) and also it does not provide an or function which is the standard way of moving registers around as it was shown in section 2.3.2. This led to the problem of "nulling" a register at times it was needed. Therefore the first register was left out and filled with zeros in order to be able to simply move/copy this register to a different one that needed to be nulled. Moving around registers was also realized differently than for normal architectures as fxvselect was used instead of or where the first operand is the destination while second and third operand are the source. The last operand, which is the conditional one, is set to 0 or left out as this substitutes the contents of the first operand always with that of the second operand. Now as this renders a simple work-around, substituting xor is more difficult and although there are ways of utilizing fxvselect to copy its functionality which we will care about later on. Particularly as fxyselect is the fast vector instruction as it takes only one cycle to complete and all other arithmetic instruction most likely take significantly more cycles. For this reason nulling a register by subtracting it by itself and storing the result in the same register would take more time hence would cost performance. Therefore the trade-off of having one less register at hand instead of wasting clock cycles continuously is adequate. Therefore since we do not want to reference the first vector register and cannot access the condition register by reference we simply reuse the register number for this cause. Unfortunately we do not have a second dispensable register of that kind and thus need to find a different index for the accumulator - though the same non-availability as for the conditional register applies. Luckily the POWER architecture does not reference the register index 64 for some reason so we can use that index without causing any conflict.

Now since we declared the vector registers' indexes we will create fitting macros as well. We need to decide on which registers shall serve which purpose. In this regard we shall also emphasize a little on fixed, call-used and saved registers. Fixed registers serve purpose that does not allow them to hold any other values at any time at all. Call-used registers are registers that are also not available for general register allocation as they are used for function calls and returning values. Only saved registers can be used for general register allocation as these may hold values over function calls such as variables that are often used throughout the program. This is done through macros which decide on the first saved register (in our case the 20th vector register) and list registers available for function arguments (a set of S2PP\_ARG\_ macros). Also FUNCTION\_...\_REGNO\_P(N) tells us if register N is used for function arguments in general. This number of course can be

optimized depending on the applications for the nux and also the style of programming but we keep the same number of registers saved as AltiVec and the FPU do.

Of course we need to go through the source files to find any use of those macros and add the new s2pp macros. We will not mention this every time as this is a standard procedure.

Now we need to get even more specific on the way the registers are organized. First we add S2PP\_REGS, S2PP\_C\_REG, and S2PP\_ACC\_REG to the enumeration reg\_class and define the identical names in REG\_CLASS\_NAMES. Then we need to specify the contents of our new register classes in REG\_CLASS\_CONTENTS.

```
/* S2PP_REGS. */
{ 0x00000000, 0xfffffffe, 0x00000000, 0x00000000 }, \
/* S2PP_C_REG. */
{ 0x00000000, 0x00000001, 0x00000000, 0x00000000 }, \
/* FLOAT_REGS. */
{ 0x00000000, 0xfffffffff, 0x00000000, 0x00000000 }, \
/* S2PP_ACC_REG. */
{ 0x00000000, 0x00000000, 0x00000000 }, \
/* S2PP_ACC_REG. */
```

Each hexnumber in these arrays can be viewed as a bit mask where the least significant bit is the first register, next higher order bit the second register and so on. Each number is 32-bit and therefore equivalent to 32 registers. The following numbers start where the previous one ended, therefore is the second number the bit mask for registers 32 through 63 (32 is the 33rd register). Therefore does <code>0xffffffffe</code> mask all register except for number 32 which is covered by <code>0x00000001</code> and 64 is masked the same way. One can see that the FPRs are masked completely as <code>FLOAT\_REGS</code>. The reason for this order is that subsequent entries must not me subsets of previous masks but may extend these. Since we also included a register which was not used at all before we also have to change subsequent masks accordingly.

Now that we defined register classes we again add macros to compare these as we did before.

Next we need to pick names for those new registers that we just defined. These were chosen according to the constraints we will define next which are kv for normal vector registers, kc for the conditional register, ka for the accumulator. Defining (in this case alternative) names for registers is also part of a macro ADDTIONAL\_REGISTER\_NAMES with elements that look like:

```
{"kc", 32}, {"kv0", 33}, {"kv1", 34}, {"kv2", 35}, \
...
```

The strings are the names for the registers and the integers are their indexes.

Now we also want to add the constraints which we just proposed. we therefore go to constraints.md and add the following code for kv, kc and ka:

```
(define_register_constraint "kv" "rs6000_constraints[RS6000_CONSTRAINT_kv
]"
```

```
"s2pp vector register")
```

where the first string is the register constraint's name and rs6000\_constraints[RS6000\_CONSTRAINT\_kv] will be assigned a register name later on in rs6000.c,where it is also declared, which in this case is S2PP\_REGS though this is only done if the target flag -ms2pp was set. The last string is only for documentary purposes.GCC (2017e) We chose k as the first initial of s2pp registers because there are very few letters left which are not used as a constraint already and k is one of them which is at least somewhat associated with the processor ("nuks"). The second character was chosen from the initial of the respective register type.

The next step which is necessary to make the registers widely accessible is adding them to rs6000.c. For once we add the new registers to rs6000\_debug\_reg\_global in order for them to be listed when printing out debug information. This is done the same way AltiVec registers are implemented and fairly easy. The same applies for pretty much all debugging support (THIS IS NOT gdb) therefore we do not mention additions to debugging, but in general these were done. It is far more important to add the registers to rs6000\_init\_hard\_regno\_mode\_ok which initializes global tables that are based on register size. Basically every index in rs6000\_regno\_regclass[] is given a register class which corresponds to the register of that index and afterwards each register class is assigned a register type in reg\_class\_to\_reg\_type[].

```
for (r = 32+1; r < 64; ++r)
    rs6000_regno_regclass[r] = S2PP_REGS;
...
rs6000_regno_regclass[S2PP_COND_REGNO] = S2PP_C_REG;
rs6000_regno_regclass[S2PP_ACC_REGNO] = S2PP_ACC_REG;
reg_class_to_reg_type[(int)S2PP_REGS] = S2PP_REG_TYPE;
...
}</pre>
```

This is done only when TARGET\_S2PP is true and substitutes the FLOAT\_REGS at indexes 33 through 63 with S2PP\_REGS and also adds the conditional and accumulator registers. To complete the process the register classes are assigned a respective register type which was added to the enumeration rs6000\_reg\_type before that. It is important to note that the s2pp normal register type was also added to the list of standard register types (again as a macro), as s2pp registers are meant to act like normal registers and also to the floating point/vector register types which is necessary as they are handled differently later in the source file.

Register can also be fixed () according to compiler flags which is done in rs6000\_conditional\_register\_usage. E.g. when the flag for soft-float is set, the compiler calls all FPRs fixed (ergo not available), which is contrary to our plan of using these registers. That is the reason why we add the alternative condition that the s2pp flag may not be set for this to happen. Instead we fix the registers 32 and 64 which are the used otherwise (kc and ka).

Now that we fulfilled all measures of adding the registers to rs6000.c we need to add them to the list of possible asm operands in ppx-asm.c where a simple definition like

previous section

#define k0 1 is enough if this is done for each register name. This tells the compiler to substitute k0 for 1 as only pure integers are valid machine operands.

The last step to completing register usage for the back-end is adding fitting predicates in predicates.md. We can copy these from the respective AltiVec predicates and change both the vector specific macros and the predicates' names. The affected predicates are:

### s2pp\_register\_operand

is a vector that is stored in a s2pp register

#### easy\_vector\_constant

returns 1 if the operand is a constant vector and now also tests for s2pp vectors and calls a new function easy\_s2pp\_constant

#### indexed\_or\_indirect\_operand

applies for indexed or indirect operands (e.g. memory addresses in registers) and now supports the case of a s2pp vector operand as for AltiVec vectors

### s2pp\_indexed\_or\_indirect\_operand

is the same but only for s2pp vectors, not AltiVec

If one wants to add own predicates, GCC offers a manuals entry GCC (2017m).

The easy\_s2pp\_constant function checks if an operand is "splittable" which means that all elements are the same and thus can be generated through a split instruction. To check this the operand is analyzed sequentially if either of the two available splats can synthesize the same operand. This was also transferable from an AltiVec equivalent with the exception that one vector mode had to be removed. prologue epilogue Additionally the similar function gen\_easy\_s2pp\_constant could also be transfered as it works the same way but generates RTL code that will create a constant vector operand from a different operand.

Now as we have completely added the new registers, we must take care of some consequences this caused. Mainly our problem is that we reused the floating point registers and did not mark them as fixed. For this reason we must scan the source files for <code>FP\_REGNO\_P(N)</code> and add an exception with && !TARGET\_S2PP when it is necessary. This is especially the case when dealing with hard registers and having the compiler emit register moves.

## 3.2 Prologue and Epilogue

Also we have a second problem that reusing FPR indexes causes. Since there is an ending number of registers the compiler can use for saving values, and this number is also reduced by the amount of fixed registers, the compiler occasionally stores values in memory before calling a function that needs some of the available registers by calling a so called "prologue" (GCC, 2017i). The compiler then restores the registers after the function has finished through an "epilogue". To understand the principles of these we will take a look at how the back-end saves registers. Before the compiler saves the registers it

earlier

must check which registers need to be saved. It then sets a variable for each register type accordingly. In rs6000\_emit\_prologue the compiler checks for these variables and starts saving the respective registers. Since the compilers looks for register numbers instead of types, we need to alter the case for FPRs and add a target flag, which activates these lines of code:

The compiler first loops over every save vector register and checks if the register needs to be stored. Then an offset is computed that takes the current frame offset and adds 16 bytes(this is the register size) for every register of this type. savereg and areg are two kinds of register that will later serve as operands. savereg is the vector register we want to save and areg will we the GPR operand needed to use vector memory instructionsHence areg loads the offset just calculated through emit\_move\_insn. emit\_move\_insn will then insert an insn to the IR that moves the second argument to the first. Now we can create a complete memory operand mem that takes offset in areg and frame\_reg\_rtx, which is the register holding the frame pointer to wheres all registers are saved, and combines them to a single operand. As both register and memory are specified the compiler can emit a move insn that stores savereg to mem. Finally rs6000\_frame\_related handles the insn which was just created and performs additional customizations which are needed as the moves belong to a function call. After the functions code has been compiled the back-end invokes rs6000\_emit\_epilogue which uses the same functions as the prologue but restores the registers from memory. Of course this function can not work on its but needs other functions that set parameters and prepare statements as well. These functions were also extended by adding conditional cases for s2pp that mostly agree with the respective AltiVec cases. An rtx, which was a type often used in the previous listing, is short for RTL expression and can be used by insns as an argument. RTL expressions can also contain information on how the operand is to be constructed as this allows chaining of operations.

## 3.3 Reload

As hinted in chapter 2 reload is a now deprecated process in GCC that mainly performs register allocation. Obviously we need to add special handling of vector registers to

fxvstax and fxvlax

reload because the registers we just added to the back-end must be used somehow. We thus get back to S2PP\_REGS which were the normal vector register form earlier. As reload is capable of moving the contents of registers, we must specify that s2pp's registers are not compatible with GPRs or any other registers and also that storing/loading registers takes two GPRs that hold the memory address. Because registers are quite different in their specifications and reload could possibly ask for any combination of source/destination register we declare only registers moves between s2pp registers only or between s2pp registers and memory valid. This also creates the need for checking the register class of an RTL expression and eventually correcting the register class. Additionally as reload does not support the addressing mode which AltiVec and s2pp both use, we must rearrange indirect addresses the same way AltiVec does. At last we also need to validate the mode which is used. All of this is done in a set of rs6000\_secondary\_reload\_functions which need to be modified.

## 3.4 Built-ins and Machine Instructions

Basically the back-end is now almost compatible with s2pp vector instructions. The only thing that is left is specifying the machine instructions which move registers or access memory. We therefore add a machine description file s2pp.md to the back-end which will contain all available vector instructions. On which is \*s2pp\_mov<mode>:

```
(define_insn "*s2pp_mov <mode>"
  [(set (match_operand:FXVI 0 "nonimmediate_operand" "=Z,kv,kv,*Y,*r,*r,
     kv,kv")
   (match_operand:FXVI 1 "input_operand" "kv,Z,kv,r,Y,r,j,W"))]
  "VECTOR_MEM_S2PP_P (<MODE>mode)
  && (register_operand (operands[0], <MODE>mode)
   || register_operand (operands[1], <MODE>mode))"
   switch (which_alternative)
    case 0: return "fxvstax %1,%y0";
    case 1: return "fxvlax %0,%y1";
    case 2: return "fxvsel %0, %1, %1";
    case 3: return "#";
    case 4: return "#";
    case 5: return "#";
    case 6: return "fxvsel %0,0,0";
    case 7: return output_vec_const_move (operands);
    default: gcc_unreachable ();
  }
  [(set_attr "type" "vecstore, vecload, vecsimple, store, load, *, vecsimple, *"
     )1)
```

We explained the basics of insn definition earlier in section ?? thus we will only explain this shortly. The name is proceeded by an asterisk that renders the name not accessible directly because this insn shall only be referred to by the RTL sequence that follows. The sequence is fairly simple and states that operand o is set by operand 1. It only gets interesting when taking a look at the constraints which build pairs for each operand and are separated by commas. The first pair for example (=z and kv) tells the compiler that a memory which is accessed by an indirect operand will be set by the contents of a vector register (this is the first use of our new constraint. Which machine instruction is used for each pair of constraints is stated in the output template which may also be in C form. Each case is indexed according to the position in the list of constraints so we take a look at case 0 where we see the matching machine instruction for storing a vector in memory fxvstax. The second operand contains also a character besides its number which is an operand modifier GCC (2017c) that will cause the operand 0 to be split into two address operands. Besides the vector load instruction fxvlax we also note the fxvsel instruction which is used to move one vector register to another(pair kv and kv) and case 6 which moves the contents of vector register 0 (this register is all 0s) and thereby nulls that register. The constraint j is the respective constraint for a zero vector.

But still there exist instruction which are not specified in detail but only by "#". A # is the equivalent to stating that there is no instruction which can perform the instruction which is described by the RTL above. This causes the compiler to look for different RTL that has the same effect but also has a machine instruction.

This process is called "insn splitting" and can also be used for optimization. A developer may also define which RTL statements are equivalent by using define\_split which is documented in GCC (2017h). As there exist splits for cases 3 through 5 in AltiVec, we may not define splits ourself.

Splitting insns is quite common for the GCC back-end because splitting insns with unspecific constraints allows for easier generation of code, since the back-end will search for code that fits the operands of the instruction which is called in the source files. Even s2pp\_mov is mainly used by calling a general vector move insn in vector.md that has the same RTL code without any constraints and then is split. We only need to add the s2pp vector mode macro to the condition of the insn which is specified through define\_expand "mov<mode>. Expanding an insn which is done for this code basically follows the principle of splitting and goes even further in not giving the option for a machine instruction at all but relies on matching RTL in some machine description of the back-end. Detailed information can be acquired in GCC (2017f).

As this is done for register moves is general we will do the same for specific loads and stores, by adding mode checking in vector.md and adding a matching insn to s2pp.md.

Beside adding the other insns we also need to add the function rs6000\_address\_for\_s2pp which converts a standard address to an s2pp compatible form. This code could be taken from an AltiVec equivalent since both vector units share the same form of memory referencing.

Now before we continue to other insns, we shall go back to s2pp\_mov and take a look at case 7 which applies for constant vectors. Constant vectors which have a constant value that does not depend on a register thus is known at compile time. GCC will look for such vectors and check if these vectors can be splatted (this means all vector elements have the same value). This is a relict of AltiVec which offers a special splat instruction that takes a constant immediate value as operand. We kept the function since we can achieve a similar effect for constant zero vectors by moving the zero register. To distinguish

between cases where this applies, we use the function output\_vec\_const\_move() which is also used by AltiVec and implemented similar code to the existing AltiVec code but reduced this to the null vector case and emit a # for splitting otherwise.

Since we are now missing an instruction to use this split we will create a split ourselves which converts the immediate splat into a normal splat:

```
(define_split
  [(set (match_operand:FXVI 0 "s2pp_register_operand" "")
    (match_operand:FXVI 1 "easy_vector_constant" ""))]
  "TARGET_S2PP && can_create_pseudo_p()"
  [(set (match_dup 2) (match_dup 3))
    (set (match_dup 0) (unspec:FXVI [(match_dup 2)] UNSPEC_FXVSPLAT))]
  "{
    operands[2] = gen_reg_rtx (SImode);
    operands[3] = CONST_VECTOR_ELT(operands[1], 1);
    }")
```

We split the upper RTL sequence which moves a constant vector to a vector register onto the bottom sequence which inserts an intermediate step. match\_dup n means that the operand should match the operand with the same index. The C code that follows specifies the newly added operands 2 and 3 further and converts operand 1 into a single integer element because all values are the same. The second RTL sequence uses the newly created integer and moves it to a GPR which then is splatted into a vector register. An unspec operator together with UNSPEC\_... tells the compiler that the operation is not specified but has a name to distinguish is from other unspecified operations.

is the insn the split refers to.

After we finally added memory instructions we can also implement them in rs6000.c. In rs6000\_init\_hard\_regno\_mode\_ok we assign the code for store and load instructions to the two supported modes in case the target flag is set but do not specify any others.

All that is left now are intrinsic functions that make use of the memory instructions. Although by now the compiler would already support asm usage as shown in 2.3.2 and which we will get back to later on. Still we do not want to stop there and also add intrinsic functions to the back-end. This is done through several steps that were also described in an internship report *Heimbrecht* (2017). Therefore we will only describe this briefly and refer to the report at times.

First we start defining insns for each vector instruction that is listed in *Friedmann* (2016). In order to allow access to the synram we implement insns that work similar to fxvstax and fxvlax and are called fxvoutx and fxvinx. All of these insns exist with different conditionals and are named accordingly because load and store insns are hard to implement with an additional argument that is the conditional. Simple arithmetic

instructions exist in multiple versions that either support conditional execution or do not. This is due to a problem that was discovered only recently with the PPU. As tests revealed the conditional execution of arithmetic instructions works improperly and in case a condition does not apply the result of a previous instruction is written to the first operand. Normally the operation should leave the contents of the operand untouched instead. For this reason we implement a workaround through insu splits that utilizes fxvselect and its conditional execution as this does work as intended. Though we still offer arithmetic operations without splits as this saves a clock cycle in comparison to having an additional fxvselect instruction. Besides those simple arithmetic operations, fxvadd..., fxvsub... and fxvmul..., there also exist more complex operations that make use of the accumulator. As of now these instructions should not be used with conditionals as further testing is imminent and it is not clear whether those conditionals would work. Besides that does an extra instruction render the advantages of an accumulator meaningless. Nonetheless do we implement those accumulator instructions together with a conditional argument which would allow for easier testing later on. Also does the additional operand not influence the performance of the instruction in a bad manner since for the conditional o the instruction works the same as if the conditional was true for all elements.

We will refer back to this when discussing the results of extending the back-end.

Since this completes the insns we can go on and create built-ins in rs6000-builtins.def from these insns. First though we must add macros that makes adding intrinsics easier. The exact definition of those macros is described for AltiVec built-ins in Heimbrecht (2017) but we can easily transfer those for s2pp by adding the RS6000\_BTM\_S2PP built-in mask which is identical with the target mask. We then use the s2pp built-in macros to create built-in definitions for each insn and also each mode (halfword or byte). Most built-ins follow the scheme of a normal function that has a result and a certain number of arguments. Btu there exists a number of insns that do not produce an output as they set the accumulator or the conditional register. These instructions need special handling and thus are defined as special built-ins.

Besides defining built-ins we also define overloads. These are used to differ intrinsics through the types of their arguments later on what makes using them easier and stops false usage. Overloads are also further explained in *Heimbrecht* (2017) As we are adding overloads we can put them to use in rs6000-c.c where overloads and built-ins are connected. This is done through structures that combine the built-in and overload names with a return type and up to three arguments.

To actually make use of these structures, we will add a function to rs6000-c.c that resolves the overloaded built-ins and is again built upon an AltiVec function that does the very same. Therefore we must tell the back-end which functions to use by making a target dependent decision in rs6000.h which overload-resolving function is handling overloads in general. Also we need a functions in rs6000.c that handle built-ins in general. First we create a new function s2pp\_expand\_builtin which is invoked by rs6000\_expand\_builtin. This function handles all special built-ins that belong to s2pp and picks expander functions according to their name. We therefore must create those expander functions that take care of certain kinds of built-ins. One such group are

memory intrinsics that handle explicit memory addressing and also apply to synram intrinsics. These expanders are needed because of the special way that memory is accessed through indirect register referencing. As for the operand modifiers we can use the same implementation that AltiVec uses but create a new function in case improvements may become necessary at some point in time.

Besides these expander functions we additionally create a new kind of expander function that did not exist before s2pp. As there is a great number of instructions that do not return any values because they write the accumulator or the conditional registers, we create expander functions for various numbers of arguments that each do not handle a return operand. These functions which were called s2pp\_expand\_unaryx\_bultin, s2pp\_expand\_binaryx\_bultin and s2pp\_expand\_ternaryx\_bultin otherwise behave quite similar.

Now that we are already handling special built-ins we need to define those built-ins in rs6000.c as well. s2pp\_init\_bultins takes care of that as it is a series of define\_builtin functions which must be written explicitly in contrast to normal built-in functions, where this is done automatically.

We did not mention so far the special case of three other built-in function that belong to the group of special built-ins as well but differ in how the RTL code/machine instructions are generated. vec\_ext, vec\_init and vec\_promote describe functions which do not belong to a specific vector instruction but are compiler-constructed sequences of machine instructions that correspond to a certain action on a vector which is mainly performed in memory. These functions are inspired by AltiVec's implementation which easily was transferable to s2pp.

Finally we conclude on built-ins by defining alternative names for built-in functions in s2pp.h.

A complete list of all intrinsic functions that the compiler supports at the time this thesis was written is available in the appendix ??.

? target pragmas compiler anweisung stack boundary -> put this together frame offsetr stackoffset fix pck instructions

## 4 Results and Applications

Now that the GCC back-end supports s2pp, we want to examine features and early applications of this.

So far the compiler back-end can be built into a working compiler which creates working machine files for the PPU. This newly built compiler also supports the use of -mcpu=nux and -ms2pp target flags and a header file named s2pp.h which can be included the same way GCC standard header files are included. Target flags and header file together will this allow for using the vector attribute which creates vector variables of different types.

These vector variables can serve as arguments for implemented s2pp intrinsic functions that cover every vector instruction which is available on nux. Additionally the new intrinsics support type detection and map automatically to the according machine instruction for each vector type if this is demanded by the user.

When using this, assigning registers as well as memory is done automatically and does not need for user interaction. Despite the different bus, the back-end can also accesses the synapse array through special intrinsics. Specifically the intrinsics fxv\_inx and fxv\_outx allow to access the synapse array and load/store variables from/to there.

As many of the mentioned intrinsics as possible were designed similar to existing intrinsics for AltiVec and use the vec\_ prefix besides th fxv\_ prefix. This was done to include both, users that are accompanied to the existing vector macros and new users that are somewhat familiar with AltiVec.

In Addition to intrinsics the compiler also supports coding in asm with vector instructions. In contrast to before the user does not need to choose hard registers or implement store and load instructions, as this is done by the compiler for asm as well. This is possible through the addition of the kv constraint that marks vector registers as r does for GRPs. Overall asm for nux became more intuitive than it had been before.

This will ultimately make it easier to combine low-level coding in high-level programs if this is ever needed.

The new back-end also supports the use of global functions that support simple function calls.

First tests which conducted during extension development, made use of intrinsic over macros and produced working machine code as well as correct results for small examples. More complex tests will be discussed in 5 David Stöckel further implemented a small series of tests which used a newly implemented unit testing framework (libnux) along intrinsics for nux in order to conduct high-level software tests. Through these early test, he was able to find a bug which previously was not known and could be identified as undocumented behavior of the nux for conditional execution of arithmetic instructions. In the end it was possible to implement a workaround for this which was already mentioned

in ?? that has minimal influence on performance and code size.

He also used the nux back-end for first experiments that made use of different functionalities of nux. One experiment had the PPU automatically increasing the synaptic weights of all synapses up to a maximum value and then decreasing the weights the same way. Another experiment updated all synaptic weights depending on spike counts to create homeostasis and yet a different experiment implemented simple STDP that relies on accessing the hardware correlation of HICANN-DLS. As all these experiments still used asm instead of intrinsics we have yet to encourage the use of intrinsics over asm in the future. Nonetheless did the nux back-end simplify usage of asm as described earlier and allowed the user to focus on tests rather than low-level operand management.

All tests used a version of GCC that was patched and then integrated to the waf build system of the group. Hence a patched cross-compiler is already commonly available. As pointed out in the beginning of this thesis we also wanted to support optimization of vector specific machine code. As for now this could be achieved for simple optimization (-01) which mainly reduces memory access to a minimum but keeps execution order. This still gives readable assembly code and should achieve similar performance to code written with the former standard macros.

This is due to the way these macros were implemented. Since the compiler did not recognize s2pp instructions before, asm statements had to be volatile to prevent vector instructions to be removed by optimization as volatile statements are protected from any optimization. Thus only code around vector instruction was affected by optimization and performance of vector instructions was up to the user's skill. Optimization going beyond (-01) is yet to be tested for reliability with the new back-end which will be discussed later on.

Besides these rather internal improvements to s2pp usage we want to point out the main advantage of the new back-end for users of nux. The main goal of this thesis was simplify programming for nux and listing ?? shows this for an exemplary program. We compare an existing test program that showcases the abilities of nux to an equivalent program that uses intrinsics instead of C macros.

write an analysis of this

supports functions

## 5 Discussion and Outlook

To summarize this thesis we want to go back to the initial motivation for this work. The future of analogue neuromorphic systems shows many possibilities that utilize the advantage over exiting simulation environments.

Extensive experiments that would exceed the boundaries of software based simulations regarding experimental time and simulated time are possible and as hardware improves their complexity will do the same. At the same time these analogue systems offer easy manipulation of different parameters during experiments without a decrease in performance. This encourages users to create new and daring test scenarios that would not be realizable for software simulations or ineffective. Even more experimental approaches involve interfaces to virtual environments and running hardware "in-the-loop".

A key feature of neuromorphic systems could be longterm experiments at high speeds that could also run on their own and apply plasticity at experimental speeds. This would give an immense amount of possible applications for the future but also generates new requirements to achieve this.

The PPU may be a key element on the way there as experiments that so not utilize the PPU usually run slower and are less flexible than PPU based experiments. This is due to the performance of vector processing that the PPU offers because of its customizations and the immediate access to the analogue system as both are combined in the HICANN-DLS. As of now the PPU offers fast calculation of simple plasticity rules and thus flexibility when compared to static systems that have predefined setting that are only updated in between experimental runs. Although the PPU can do this for only so much complexity as the hardware is quite limited, it fulfills he requirements for simple optimization, implementing reduced virtual environments that need high latency and managing calibration of the system. It also gives the possibility of running experiments without supervision as the PPU could collect data as well.

All of this needs code that is at the same efficient and favorably of small size. At the same time software should be easy to write in order to appeal for many users and raise interest for these systems. One way to get all of this is working compiler support where the compiler translates simple code into optimized machine code that is specifically created for this target. Creating a stand-alone compiler that supports the PPU's custom architecture could therefore improve development for the PPU noticeably. An equally effective but maybe more complete approach to this is extending the back-end of an existing compiler that also offers different tools for development and multiple front-ends. As this is the case for GCC this thesis dealt with the task of extending an existing back-end that already supported the core architecture of the PPU's nux architecture and implementing support for the custom s2pp vector extension.

This goal could be achieved to some degree as the back-end does support nux' vector

extension but however is still in a testing stage that hopefully will transition into the standard cross-compiler for the PPU in the near future.

For once the back-end needs testing as only this can show if it is reliable and may reveal bugs that were not obvious before. There even exist first test cases that aim for high-level software tests and already test single intrinsics and should be extended to more complex testing scenarios that involve various combinations of intrinsics with different arguments and dependencies as well as conditional branching and looping which tested on the PPU as well as in simulation. This may also find from eventual bugs or unexpected behavior in the architecture itself or in the compiler and overall improve reliability.

The most recent release of the GCC 4.9 release series dates only back to 2016 and though maintenance is officially discontinued? there exist newer bugs which were fixed through patches thanks to the active community behind GCC. Therefore it is possible—although unlikely—that internal compiler errors are caused by the original compiler instead of the back-end. It should be therefore considered if moving to a newer release—which is 4.9.4—made sense to reduce compiler liability to a minimum and focus solely on the back-end. There is also a number of tests for the AltiVec back-end extension that should be taken in consideration when implementing new nux tests.

Besides high-level testing we can also utilize the newly extended back-end for low-level testing of the architecture itself that is not pure assembly coding but utilizes the asm support of the back-end. Through this we were able to find undocumented undocumented behavior of the PPU when using conditional instructions which seemed like a bug in the back-end at first...

Since there exists software simulation for nux but only a small number of low-level test cases, the development of more tests should be focused as this leads to a well documented nux architecture which is necessary for reliable high-level usage. This should involve testing random combinations of instructions and operands which could be realized by a random code generator that works hand in hand with a simulation platform. The planned nux simulation on a FPGA would accelerate this development and also encourage continuous testing.

Creating such a comprehensive testing environment would allow to constantly test the reliability of generated code and different optimizations and thereby help establishing GCC with an extended back-end as the preferred compiler for nux.

Another feature that might help push the general usage of a patched cross-compiler is the availability of gdb for nux. As debugging PPU software right now is not possible with gdb it might be possible in the future but it is likely that more work on the back-end is necessary for this. Right now there have been no tests of gdb with the new back-end and other features such as optimization and testing are more important at the moment.

At most, only time will tell if the extended-GCC build will be used regularly and become the standard tool for nux development.

This depends on various factors besides what we already mention. GCC support for the POWER architecture, reimplementation of the GCC back-end and the future development of the nux architecture are also important. As the POWER architecture is well established and first systems with POWER9 under way we can safely assume that the POWER architecture will be supported by GCC for years to come. Also the radical reorientation of GCC by the GCC steering committee might seem unlikely we can still rely on existing older releases as newer versions of GCC rarely offer striking new features and older versions continue to receive patches although there are no longer officially maintained.

There exists an experimental build of GCC 7 with an early version of the s2pp back-end and also the latest binutils version by David Stöckel but this build has not been tested yet and only demonstrates the possibility of porting the back-end patch to different GCC releases if it ever became necessary.

Hence the most crucial development is that of the nux architecture. If it is ever decided that the nux is to be completely redesigned this would likely render the current backend pointless and one must start from scratch when creating compiler support for that architecture. Smaller changes however, i.e. adding a set of logical vector operations, may be supported by adding these to the machine description and creating intrinsics from this as described in ?? and (?).

The back-ends structure would also allow for adding custom intrinsics that can be composed from existing machine instructions through the machine description and RTL code.

Eventually the extended GCC back-end may be usable for many years and even longer if the back-end is maintained.

Besides getting future compiler support there are many features yet to come for the PPU: Adding new vector instructions to the instruction set as well as allowing for direct access to the memory of the FPGA that is connected are only two features that could help creating new experiments that were not possible by now and that may also profit from established compiler support. Also should gdb support help creating those applications on HICANN-DLS. Having established the prerequisites for code generation for the PPU we may also expect development in this particular area which further facilitates development

Over all the PPU should get ever more capable over time and take care of different tasks like calibration, creating simple virtual environments, configuration of the system and also supervision of experiments.

of new experiments by many users.

Ultimately PPUs might be running independently on multiple systems, realizing experiments with large simulated networks or running standalone experiments for users in parallel. Still, the future of the PPU is welded by users, developers and the applications they for this innovative system, and giving them a functional compiler for this system is the tool that might come in handy at times.

# **Appendix**

## **Acronyms**

ADC Analogue Digital Converter

**ALU** Arithmetic Logic Unit

**ASIP** Application Specific Instruction set Processor

asm Assembly

**CISC** Complex Instruction Set Computer

CPU Central Processing Unit

**CR** Conditional Register

**DLS** Digital Learning System

**DRAM** Dynamic RAM

insn instruction

IR Intermediate Representation

FPGA Field Programmable Gate Array

FPR Floating Point Register

**FPU** Ploating Point Unit

**GCC** GNU Compiler Collection

**GPP** General Purpose Processor

**GPR** General Purpose Register

**HICANN** High Input Count Neural Network

**ISA** Instruction Set Architecture

**LLVM** Low Level Virtual Machine

LR Linker Register

**LRA** Local Register Allocator

MMU Memory Management Unit

**MSB** Most Significant Bit

nux alternative name for PPU

**POWER** Performance Optimization With Enhanced RISC

PPU Plasiticty Processing Unit

RAM Random Access Memory

**RF** Register File

RTL Register Transfer Language

**RISC** Reduced Instruction Set Computer

**rs6000** RISC system/6000

s2pp synaptic plasiticity processor a.k.a PPU

SIMD Single Input Multiple Data

**SPR** Special Purpose Register

**SRAM** Static RAM

**VE** Vector Extension

**VR** Vector Register

VRF Vector Register File

VMX Vector Media eXtension

## List of nux Intrinsics

## 5 Discussion and Outlook

| intrinsic name     | use                   |                                                                                   | data                                                                                       | a types                                                                                                                                         |                                                                                                          | effect                                                                                             |
|--------------------|-----------------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
|                    |                       | d                                                                                 | a                                                                                          | b                                                                                                                                               | С                                                                                                        |                                                                                                    |
| fxv_add<br>vec_add | d = fxv_add(a,b)      | same as a                                                                         | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | same as a                                                                                                                                       |                                                                                                          | add a and b modulo element-<br>wise and write the result in d                                      |
| fxv_sub<br>vec_sub | d = fxv_sub(a,b)      | same as a                                                                         | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | same as a                                                                                                                                       |                                                                                                          | subtract b from a module<br>element-wise and write the re-<br>sult in d                            |
| fxv_mul<br>vec_mul | d = fxv_mul(a,b)      | same as a                                                                         | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | same as a                                                                                                                                       |                                                                                                          | multiply a and b module<br>element-wise and write the re-<br>sult in d                             |
| fxv_addfs          | d = fxv_addfs(a,b)    | same as a                                                                         | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | same as a                                                                                                                                       |                                                                                                          | add a and b saturationa<br>element-wise and write the re-<br>sult in d                             |
| fxv_subfs          | d = fxv_subfs(a,b)    | same as a                                                                         | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | same as a                                                                                                                                       |                                                                                                          | subtract b from a saturationa<br>element-wise and write the re-<br>sult in d                       |
| fxv_mulfs          | d = fxv_mulfs(a,b)    | same as a                                                                         | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | same as a                                                                                                                                       |                                                                                                          | multiply a and b saturational<br>element-wise and write the re-<br>sult in d                       |
| fxv_stax<br>vec_st | fxv_stax(a,b,c)       |                                                                                   | vector signed char vector unsigned char vector signed short                                | int                                                                                                                                             | vector signed char* signed char* vector unsigned char* unsigned char* vector signed short* signed short* | $\begin{array}{c} a \text{ is stored to memory address} \\ c + b \end{array}$                      |
|                    |                       |                                                                                   | vector unsigned short                                                                      |                                                                                                                                                 | vector unsigned short* unsigned short  vector signed char* signed char* vector unsigned char*            |                                                                                                    |
| fxv_outx           | fxv_outx(a,b,c)       |                                                                                   | vector unsigned char  vector signed short  vector unsigned short                           | int                                                                                                                                             | unsigned char*  vector signed short*  signed short*  vector unsigned short*                              | a is stored to synaptic address c + b                                                              |
| fxv_lax<br>vec_ld  | d = fxv_lax(a,b)      | vector signed char vector unsigned char vector signed short vector unsigned short | int                                                                                        | vector signed char* signed char* vector unsigned char* unsigned char* vector signed short* signed short* vector unsigned short* unsigned short* | unsigned short                                                                                           | $\begin{array}{c} d \text{ is read from memory address} \\ a + b \end{array}$                      |
| fxv_inx            | d = fxv_inx(a,b)      | vector signed char vector unsigned char vector signed short                       | int                                                                                        | vector signed char* signed char* vector unsigned char* unsigned char* vector signed short* signed short* vector unsigned short*                 |                                                                                                          | d is read from synaptic address a + b                                                              |
| fxv_sel            | d = fxv_sel(a,b,c)    | vector unsigned short                                                             | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | unsigned short                                                                                                                                  | 2-bit int                                                                                                | select element from a if c ap<br>plies for that index otherwise<br>select b, store the result in d |
| vec_extract        | d = vec_exctract(a,b) | signed char<br>unsigned char<br>signed short<br>unsigned short                    | vector signed char<br>vector unsigned char<br>vector signed short<br>vector unsigned short | int                                                                                                                                             |                                                                                                          | d is read from synaptic address a + b                                                              |

vecinsert
vecpromote
vecsh fxvsh
fxvpckl
fxvpckl
fxvupckl
fxvupckr
vecsplats l5 fxvsplath
fxvsplath vecsplats
fxvmtac
fxvmtac
fxvmtac
fxvmtac
fxvmtac
fxvmddactacm
fxvaddactacf
fxvaddactacf
fxvaddacts
fxvman
fxvmas
fxvmas
fxvmats
fxvmats
fxvmats
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatacm
fxvmatach
fxvmatach
fxvmatach
fxvmatach
fxvmatach
fxvmatach

# Notes

| paper on future of neuromorphic computing                                                   |  |  |  |  |  |
|---------------------------------------------------------------------------------------------|--|--|--|--|--|
| other systems                                                                               |  |  |  |  |  |
| bild synaptic array                                                                         |  |  |  |  |  |
| figure synapse                                                                              |  |  |  |  |  |
| figure vector extension                                                                     |  |  |  |  |  |
| code example                                                                                |  |  |  |  |  |
| reference                                                                                   |  |  |  |  |  |
| maybe different                                                                             |  |  |  |  |  |
| figure of vector size                                                                       |  |  |  |  |  |
| check with ahartel on this                                                                  |  |  |  |  |  |
| applications of the PPU today like in-the-loop experiments and controlling $$ . $$          |  |  |  |  |  |
| add reference                                                                               |  |  |  |  |  |
| cite friedmann dissertation                                                                 |  |  |  |  |  |
| figure classic architecture                                                                 |  |  |  |  |  |
| complexity $>$ lower clock freq                                                             |  |  |  |  |  |
| inhaltlicher zusammenhang                                                                   |  |  |  |  |  |
| book                                                                                        |  |  |  |  |  |
| graphic of opcode                                                                           |  |  |  |  |  |
| reference                                                                                   |  |  |  |  |  |
| reference                                                                                   |  |  |  |  |  |
| caption and reference to PPC book and asm website correct the table 19                      |  |  |  |  |  |
| all these tables in the appendix                                                            |  |  |  |  |  |
| reference LLVM vs. GCC on ARM, LLVM vs. Gcc in EISC                                         |  |  |  |  |  |
| examples of RTL code                                                                        |  |  |  |  |  |
| $\mathrm{MMU}$ ?= memory controller PPC must handle syncing in compiler when $\mathrm{I/O}$ |  |  |  |  |  |
| is added explain stack and frame pointer PPU instruction set                                |  |  |  |  |  |
| do more often like this: one example for repetitive use                                     |  |  |  |  |  |
| necessary?                                                                                  |  |  |  |  |  |
| check this reference                                                                        |  |  |  |  |  |
| explain this later                                                                          |  |  |  |  |  |
| previous section                                                                            |  |  |  |  |  |
| earlier                                                                                     |  |  |  |  |  |
| fxvstax and fxvlax                                                                          |  |  |  |  |  |
| ? target pragmas compiler anweisung stack boundary -> put this together frame               |  |  |  |  |  |
| offsetr stackoffset fix pck instructions                                                    |  |  |  |  |  |
| !listing!                                                                                   |  |  |  |  |  |
| write an analysis of this                                                                   |  |  |  |  |  |

| 5 | Discussion | and | Outlook |
|---|------------|-----|---------|
|   |            |     |         |

# **Bibliography**

Altivec manual, chapter.

Gcc wiki, accessed 2017.03.26, 2013.

- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., https://gcc.gnu.org/onlinedocs/gccint/index.html, 2017a.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 13.9 RTL Expressions for Arithmetic, 2017b.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 6.45.2 Extended Asm Assembler Instructions with C Expression Operands, 2017c.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.19 Instruction Attributes, 2017d.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.8 Operand Constraints, 2017e.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.15 Defining RTL Sequences for Code Generation, 2017f.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.2 Everything about Instruction Patterns, 2017g.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.16 Defining How to Split Instructions, 2017h.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 17.9.11 Function Entry and Exit, 2017i.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.23 Iterators, 2017j.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 13.6 Machine Modes, 2017k.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 8 Option specification files, 2017l.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.7 Predicates, 2017m.

- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 13.8 Registers and Memory, 2017n.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 13.15 Side Effect Expressions, 2017o.
- GNU Compiler Collection Internals Manual, Free Software Foundation Inc., 16.9 Standard Pattern Names For Generation, 2017p.
- Friedmann, S., nux Manual, 2016.
- Friedmann, S., J. Schemmel, A. Grübl, A. Hartel, M. Hock, and K. Meier, Demonstrating hybrid learning in a flexible neuromorphic hardware system, 2016.
- Heimbrecht, A., Internship report altivec intrinsics, 2017.
- von Hagen, W., *The Definitive Guide to GCC*, section Introduction, pp. xxiii–xxix, second ed., Apress, 2006.

# Statement of Originality (Erklärung):

| I certify that this thesis, and the research to which it refers, are the product of my own work. Any ideas or quotations from the work of other people, published or otherwise, are fully acknowledged in accordance with the standard referencing practices of the discipline. |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Ich versichere, dass ich diese Arbeit selbständig verfasst und keine anderen als die angegebenen Quellen und Hilfsmittel benutzt habe.                                                                                                                                          |
| Heidelberg, February 28, 2017(signature)                                                                                                                                                                                                                                        |