

# Simulation of a MIPS machine

Jan Mezník PZJ895



#### Abstract

TODO

| C | ontents                                                                                                                                                                                                       | 7                                         | TLB                                                                          |                      |
|---|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|------------------------------------------------------------------------------|----------------------|
| 1 | Introduction 1.1 Motivation                                                                                                                                                                                   | 1 8<br>1<br>1                             | MMU 8.1 Privilege levels 8.2 Translation Lookaside Buffer 8.3 Implementation | 15<br>16<br>16<br>16 |
| 2 | CPU Architectures 2.1 Instruction Set Architectures 2.2 MIPS Architecture                                                                                                                                     | 1<br>2<br>9                               | User and Kernel mode                                                         | 16                   |
| 3 | MIPS Core Processing Unit  3.1 Registers                                                                                                                                                                      | 3 11<br>3 12<br>3 13                      | Tests Performance Conclusion opendices                                       | 16<br>16<br>16<br>19 |
| 4 | 3.7 Memory                                                                                                                                                                                                    | 6<br>7<br>7<br>8<br>8<br>8<br>9<br>9<br>9 |                                                                              |                      |
| 5 | Exceptions and Interrupts 10  5.1 Precise Exceptions 10  5.2 Handling Exceptions 10  5.3 Interrupts 11  5.3.1 Polled Interrupts  and Vectored Interrupts 11  5.3.2 Exception procedure . 12  5.4.1 Interrupts |                                           |                                                                              |                      |
| 6 | Memory Mapped IO 6.1 Direct Memory Access 6.2 Detecting and mapping peripherals 6.3 Device descriptors 6.4 Implementation                                                                                     | 13<br>14<br>14<br>15<br>15                |                                                                              |                      |

# 1 Introduction

This report describes the development of a MIPS simulator, intended to support the operating system KUDOS. The simulator will be written in C, and will support the most important processor required to run KUDOS, such as the translation lookaside buffer (TLB), memory management unit (MMU), user and kernel CPU modes, multiple cores (SMP), and I/O device emulation.

#### 1.1 Motivation

KUDOS is a small operating system skeleton intended to be used by students attending operating system project courses at university of Copenhagen. It is used to explore operating system concepts by extending and improving on existing system. Initially, KUDOS targets the MIPS32 architecture, which leverages on the advantages of a reduced instruction set computing - RISC.

To ease the development and debugging of KUDOS, it is desireable to run the OS in a simulated machine. This enables the students and other developers to better inspect the state of the machine while executing, as well as making up for the difference in the hardware of the host machine.

#### 1.2 A simulator

Simulation involves modelling the state of a given target. The purpose of a simulator is to be able to look inside the simulation and inspect the properties and behaviours, that would otherwise only be seen in the real target. A possible by-product of a simulator is that the simulation model will emulate the target and its behaviour, practically immitating the target.

An emulator, on the other side, is intended to only mimick the behaviour of a target on the outside, not correctly reflecting the internal state of the target. Emulators are often used as a substitute for the real targets. The difference between emulation and simulation is therefore, that unlike simulating a system, an emulator only imitates the outward behaviour of its target, and it is hard to predict how the target would act internally.

For example, Android developers often use an emulator to test their applications. Instead of having thousands of real smartphones, they can simply emulate<sup>1</sup> the devices on their computers, saving both time and money.

A common use of a simulator is in the aircraft business, where flight-simulators are being used for both pilot training, engineering, design, and many other purposes. While it gives a very good insight on how a simulated airplane might react in given situations, it does not actually move the users from one point to another.

In our case, we will be modelling the state of a MIPS32 machine, which will be running KUDOS - that is, we will be simulating a MIPS32 machine. Given that our simulator model is implemented well enough, we will be able to run KUDOS, just as on an emulator.

#### 2 CPU Architectures

At the heart of every computer lies the Central Processing Unit (CPU), which is an electronic circuit that carries out the basic arithmethic- and logic calculation as well as process and redirect input and output to other devices in the computer, using the shared busses.

Most modern CPUs are contained on a very small, yet packed intergrated circuit chip, which can also house memory caches, multiple cores, and other processing units.

The functionality of all processors is fundamentally the same. The processor executes some primitive operation by fetching an instruction in the form of binary signals, act upon the instruction and store the result in either one of the its registers, or in the main memory.

A single instruction does very little, but

<sup>&</sup>lt;sup>1</sup>To be precise, they are emulating the hardware devices, but the applications they are testing, are being simulated.

a collection of instructions make up a program. In the very early computing days, computers were programmed in an assembly language, which is simply humanreadable instruction code. As the computers grew more powerful, more complex and much faster, larger programs could be executed. Because it is hard and timeconsuming to write programs using only the assembly language, compilers are used to remove this complexity. A compiler takes a high-level language, such as C, C++ or Java, and creates the corresponding assembly language program for the specific architecture, containing the instructions. This assembly language file is in turn assembled or translated to binary, that the particular CPU can understand.

Besides hiding the complexity of the underlying architecture away from the programmer, it can usually also compile programs to multiple architectures as well as optimising the code to run faster.

#### 2.1 Instruction Set Architectures

The instructions supported by a particular processor is determined by the Instruction Set Architecture (ISA), which is the specification of how the CPU works. An ISA determines the instructions supported, the registers available, memory architecture, addressing modes as well as handling of interrupts.

There exists many different types of ISAs, with both their advantages and disadvantages. For example, some architectures have a very few instructions and registers, which is very practical for small embedded devices, whereas large servers might make use of a large array of registers for complex computations.

Besides the current use of an architecture, designers must also take into account its future uses and applications. As the world of computation is ever growing and evolving at exponential rates, the architectures must be up to the challenge of future computing. Introducing a completely new architecture to the market is very troublesome, and causes a list of problems. One of the main issues is that old software writ-

ten for older architectures will no longer work, and it requires to be either recompiled, rewritten, or even emulated. One such example is the Intel Itanium (IA-64) architecture, which had a very bad marked reception due to its lack of backwards compatibility with the x86 architecture. The emulation of the architecture on IA-64 yielded suboptimal performance and ultimately lost to the AMD x86-64, which in turn was compatible. [6]

Indeed, there are a lot of factors to take into account when designing a new architecture, and every decision has big implications on the future of the whole ISA.

#### 2.2 MIPS Architecture

The MIPS architecture (acronym for Microprocessor without Interlocked Pipeline Stages) was first created in the early 1980s. [19] MIPS is a reduced instruction set architecture (RISC), developed by MIPS technologies, to bring new levels of performance and efficiency into the world of processing units. As an RISC architecture, MIPS aims to implement only the most essential instructions, so that they in return can get highly optimised. This is the based on the RISC philosophy, that by implementing only the most common instructions, the architects can simplify the design and speed up the crucial parts of the instructions. This enables the processor to execute programs faster, but also removes a lot of complexity of implementing large programs.

In contrast to RISC, complex instruction set architecture (CISC) aims to reduce the number of instructions needed to execute a program by implementing instructions packed with functionality. This means that a single instruction in CISC can execute several operations at once, such as loading from memory, arithmetic and storing. While complex programs indeed execute faster on a CISC architecture, the burden of implementing efficient and maintainable code and compilers can outweigh its advantages. [17]

Besides the inspiration from RISC, MIPS has added its own design principles, which are honored and used to question every change, implementation, or design. These are [16]:

- Design Principle 1: Simplicity favors regularity.
- Design Principle 2: Smaller is faster.
- Design Principle 3: Good design demands good compromises.
- Design Principle 4: Make common case fast.

These decisions withstood the trial by fire and proved, that honoring these principles yields good design, easing implementation as well as simplifying hardware.

# 3 MIPS Core Processing Unit

MIPS CPUs are pipelined, meaning that it implements a pipeline which enables it to execute different stages of multiple instructions at once. This gives the processor a higher throughput that would otherwise be possible at a given clock-rate. The processor has 31 general purpose (GP) registers, with additional registers per coprocessing unit. Even the first models of the MIPS CPUs, such as the MIPS R2000, had memory caches and a translation lookaside-buffer, which improves the speed of the processor by reducing the number of main memory lookups.

# 3.1 Registers

MIPS contains multiple types of registers. The most common and most used registers are the general-purpose registers (GP), which can be used for practically anything by the programmer. Special registers are registers implemented for cases where GP registers were either too small or otherwise unsuitable for the purpose.

For additional functionality, the MIPS coprocessor 0 also has its own set of registers that, along with an operating system, bring many features to the system.

#### 3.1.1 General Purpose Registers

In MIPS, there are 32 general-purpose registers, all 32 bit wide. Although they can all theoretically be used however the programmer or assembler wants<sup>2</sup>, there are some conventions for the use of the registers.

| $\mathbf{Mnemonic}^3$ | #     | Use                |
|-----------------------|-------|--------------------|
| \$zero                | 0     | Constant Value 0   |
| \$at                  | 1     | Reserved Temporary |
| \$v0-\$v1             | 2-3   | Function Results   |
| \$a0-\$a3             | 4-7   | Function Arguments |
| \$t0-\$t7             | 8-15  | Temporaries        |
| \$s0-\$s7             | 16-23 | Saved Temporaries  |
| \$t8-\$t9             | 24-25 | Temporaries        |
| \$k0-\$k1             | 26-27 | Reserved for OS    |
| \$gp                  | 28    | Global Pointer     |
| \$sp                  | 29    | Stack Pointer      |
| \$fp                  | 30    | Frame Pointer      |
| \$ra                  | 31    | Return Address     |

# 3.1.2 Special Registers

The special registers in MIPS cannot directly be accessed from the program. Rather, they are modified by dedicated instructions.

| Name | # | Use                    |
|------|---|------------------------|
| \$HI | - | Hi-word of 64bit value |
| \$LO | - | Lo-word of 64bit value |
| \$PC | - | Program Counter        |

HI and LO registers are used to contain the result of a multiplication or division, which, using 2 32bit registers, can end with a 64bit result.

The PC register is pretty self-explanatory, as it simply points the current location in the program (or "counts" the instructions). On other architectures, this register is better known as the Instruction Pointer (IP).

#### 3.1.3 Co-processor 0 registers

Registers in co-processor 0 are mainly used by the system, to provide additional features. The co-processor can have 32 regis-

 $<sup>^{2}</sup>$ Except the 0'th (\$0) register, which can only hold the value 0.

 $<sup>^3\</sup>mathrm{Textual}$  representation used in the assembly language

ters, but only few of them are used consistently across the different CPU implementations. That is due to the wide applications of the MIPS architecture, where manufacturers only define and implement the truly necessary co-processor registers.

| Name    | # | Use                           |
|---------|---|-------------------------------|
| index   | - | TLB entry index               |
| random  | - | TLB random access register    |
| entrylo | - | Low order current TLB entry   |
| context | - | Page-Table lookup addr.       |
| vaddr   | - | Virtual address of exceptions |
| entryhi | - | High order current TLB entry  |
| status  | - | Processor status              |
| cause   | - | Exception cause               |
| ерс     | - | PC when exception occurred    |

The status register is a bit-field of flags used to signal the current state of the processor. It is similar to the EFLAGS register on x86 architectures.

The rest of the registers will be discussed in depth in the SMP chapter.

#### 3.2 Instructions

Each instruction in MIPS is 32-bit long, aligned to word. This simplifies the instruction fetching, decoding, as well as disassembly of the program, for both the processor as well as the programmer.

In MIPS, the instructions have 3 basic formats:



Figure 1: The 3 MIPS instruction formats.

It is clear that, as they have common fields, mainly the opcode field, they are easily distinguishable.

The R-format instructions are mainly used when all the data being processed is located in the registers. That includes

adding between registers, binary operations on values in registers as well as jumping to an address located in a register.

The I-format instructions can operate on both data from registers and immediate values encoded directly in the instruction (thus the 16-bit immediate field). I-format instruction share a lot of common operations with the R-format, where one of the operands is the immediate.

J-format instructions are used solely for jumping instructions, thus the large address field. As it only has 26 bits to address an 32 byte memory location, it shifts the whole value twice to the left, as to align the value in words. The upper 4 bits are retrieved from PC. In practise, this is enough to jump to any address in the program.

# 3.3 Arithmetic Logic Unit

Without any extension co-processing unit, the Arithmentic Logic Unit (ALU) in MIPS32 only supports operations on integers.

The ALU supports basic mathematical operations such as adding (add), subtracting (sub), as well as logical shifting to both left and right (sll, srl), which also can be used to division or multiplication by even numbers.

All bitwise logical operators and, or, and nor are implemented Using these, which additional missing logical operations can be created, such as nand and not.

Since both the operand and destination registers are 32 bit wide, an overflow in the result might occur - that is, the result is larger than what a 32 bit register can hold. In that situation, an exception is raised in the processor, and code to recover from this error is run [16]. This will be explained in later sections.

#### 3.4 Load and Store

MIPS is a "load/store" architecture, where memory is only accessed by specific load and store instructions [5]. This design is a very common for RISC architectures, as it greatly simplifies the pipeline stages and clock timings. In contrast, CISC architectures have many instructions that can do operations on both memory and registers at the same time. For example, on the x86\_64 architecture, the MOVSW instruction reads from a memory location pointed to by register SI, stores it in memory location DI, and at last, increments (or decrements<sup>4</sup>) both registers [7]. This adds additional stall and hazard logic to the processor, and makes it is hard for the CPU to determine how many clock-ticks the instruction will take.

MIPS32 uses 1w for loading a word from the main memory into the register, and a sw, which stores the value from register into the specified memory location. In reality, MIPS32 also has LH, LB and their store counterparts SH SB, which operate on half-word and byte sized loads and stores. However, for performance reasons, the main memory always reads a word (4 bytes), and so, the desired size is computed in the CPU.

### 3.5 Jumping and Branching

To be turing complete, the processor needs to be able to do conditional jumps to other memory locations. This is done with the jumping instructions: jump (j), jumpregister (jr) and jump-and-link (jal). The conditional jumps are: branch-equal (beq) and branch-not-equal (bne).

On the bare-metal level of the processor, these instructions simply modify the value of the Program Counter register, which is otherwise inaccessible from assembly.

#### 3.6 Interrupts

Interrupts are a special way to control what the CPU. It actively "interrupts" the CPU from its current job, and makes it execute a special procedure, specified by an interrupt number. There are usually 3 types of interrupts [14]:

### • Exceptions

Exceptions occur in software, usually when an error has occurred that needs attention from the kernel. This is usually caused by reading from illegal memory addresses or when arithmetic overflow occurs.

#### • Hardware Interrupt

Hardware interrupts are initiated from hardware devices, such as a mouse or a keyboard. When a user presses a key or moves the mouse, the hardware devices send a signal to the CPU that something has happened that needs attention from the kernel.

### • System Call (syscall)

Syscalls are usually used by programs, when they need attention from the kernel. An operating system and the underlying kernel will usually expose an interface with a whole set of functions, that the program can access by syscalls. This can be everything from reporting termination of a program to writing data to the disk.

The action that the CPU has to perform is determined by an interrupt vector table. For each interrupt vector, there is specific code to be executed. Because the interrupt vector table is limited in size, operating systems, such as Linux, use a single interrupt vector number 0x80. Additional arguments for further determination of the service are passed in service number, which is stored in the general purpose registers, and if needed, on the stack.

System call handling is made somewhat easier in MIPS. Whereas in x86-64, you have to set the appropriate system-calls arguments and then do an interrupt on the correct vector, MIPS has a dedicated system-call instruction syscall. The operating system can choose however the arguments are passed, but usually, the service number is stored in \$v0, and the arguments in \$a0-\$a3 [16].

<sup>&</sup>lt;sup>4</sup>This is determined by the direction flag, which determines whether the CPU reads memory from top to bottom or in reverse.

# 3.7 Memory

Memory is an essential part of a working computer. The memory stores both the program code (the instructions) as well as the program variables, which cannot fit in all the CPU registers.

The memory is a large chuck of consecutive bytes. While these can store practically anything, operating systems designate memory areas into multiple segments (also called sections), which eases memory management. While many operating systems create many different segments in the memory map of a program, these segments usually fall into 4 categories [4] [16] [21]:

#### • .text

This section consists of the executable contents of a program, which is the instructions. In modern processors and OS, this section cannot be written to.

#### • .data

This segment contains the initialized variables used in the program. This section can be both written and read from, but is not executable.

#### • .bss

Uninitialized memory segment. This allocates memory for variables, which are not necessarily initialized by the OS, and can contain random values.

#### • stack

The stack usually starts at the top of the user memory area. It is a Last-In First-Out (LIFO) structure, meaning that elements added last are the first ones being removed, hence the name. The top of the stack is usually pointed to by a designated register in the CPU, in MIPS32, that is the \$\$P\$ register. The stack is used for local variables within functions, which usually have a very short lifespan.

#### • heap

The heap starts immediately after the initialized segments (.text and .data), and is used for variables with a longer lifespan. This memory area grows from bottom up, that is, starting at a lower address, moving to higher addresses. This memory area is maintained by the program, and can store practically any variables and other data-structures. This area can also be extended by the operating system, which makes it ideal area for storing large amounts of data.

While the memory layout of a process is completely implementation specific on the underlying operating-system, on most 32bit UNIX systems, the memory area is very similar to what can be seen on figure 2.



Figure 2: Process memory map.

Although the address space is 32 bit, the top half is not available for the user, and it is usually used by the operating system [23].

# 4 Pipeline

CPU speeds are usually measured by timing the execution time of programs. Since a computer program is just a collection instructions, the speed of the CPU is determined by how fast it can process each instruction. Every CPU has a clock, which ticks at a given rate. For every tick, a new instruction is executed. This clock ensures that all instructions "flow" through the processor without problems, and that the electrical components, such as the ALU

or the control-unit, can manage to carry out their tasks in that time. Naturally, electrical engineers have pushed the limits of the circuits to manage the highest clock rate. The clock rate of the very first processors was measured in hertz and kiloHertz (kHz), but most modern desktop CPUs reach in multiple GigaHertz (GHz) [26]. However, even with those speeds, the demand for faster processing units is ever-growing, and other techniques to speed-up the execution are used.

One of those techniques is pipelining, which separates the circuit into multiple stages, much like the assembly lines in factories. In such factories, workers have their own station at the assembly line, do a specific task repeatedly, and forward it down the line. This greatly increases the throughput of a factory and decreases the labor need. In MIPS, this idea is implemented by separating the processor into 5 stages [16]:

- Instruction Fetch (IF)
  Fetches the next instruction.
- Instruction Decode (ID)
  Reads the instruction, sets the appropriate control flags, reads the relevant registers and sends the data to the next stage.
- Execute

Executes the instruction. This is typically done by the ALU with the appropriate operation supplied.

#### • Memory Access

All operations on memory happen here. This stage either loads a memory address or stores a value at an appropriate address.

Write Back
 Writes the results to the CPU registers.

Each of these stages will naturally use less time than all of them combined, and since the clock is shared in all stages, it is set to the slowest stage in the pipeline.

Not only do we have faster tick rate on our clock, but we are also able to perform multiple operations concurrently. Figures 3 and 4 show the timing of each instruction, and how pipelining might improve the whole process.



Figure 3: Single cycle implementation.



Figure 4: Pipelined approach.

# 4.1 Design of the MIPS32 Pipeline

The advantages of a pipelined design does not come without a price. Although the single-cycle implementation of the processor is very similar to the pipelined approach, it has its fair sets of challenges. The main problem with executing instructions concurrently is that the instructions will often rely on the result of the previous instructions. These situations are referred to as hazards.

#### 4.1.1 Data Hazard

Data hazards mainly occur when an instruction cannot continue, because it must wait for the result from an earlier instruction. Suppose a program wants to calculate the sum of 4 integers:

$$A = A + B + C + D$$

In MIPS32 assembly, this would be written as:

Figure 5: Code exposing data hazard situation.

Here, the first two instruction will have no trouble executing, as they do not share any source or destination registers. third instruction however, will not be able to fetch the updated values. When it is in the ID stage, where it decodes the register s0 and s1 values, the previous instructions are still in the pipeline, in the EX and MEM stage! These instructions have not written back their results in the appropriate registers, and so, instruction 3 cannot fetch the correct value of s0 and s1, unless it waits 3 clock cycles. situation can be visualized in a sequential graph show in figure 5, where it can be seen that the result of the first two instructions is needed in the third.



Figure 6: Time-sheet of the program listed in figure 5.

The problem is very clear - the data is needed before it is saved at the appropriate place. However, the data is usually already available in the EX stage in the ALU, and can therefore be used in the same clock-cycle in the ID stage. This is called forwarding or bypassing, and is handled by a dedicated Forwarding Unit. This unit is wired to both EX, MEM and WB stages, and has a logic unit that checks for matching registers in the instruction, in which case it forwards the correct result back in the pipeline.

#### 4.1.2 Control Hazard

Another type of hazard is the Control Hazard, which occurs when the processor must make a decision based on an instruction, while others are executing. Whether a branch is taken or not can be determined by the ALU in the EX stage, which does a equality check on the two values. This does not add much additional logic to the circuit, as the forwarding unit already bypasses all the most recent results as the operands to the ALU. If a branch is taken, a challenge emerges that we already have the next instructions in both ID and IF stage, even though our program is inteded to branch away. There are many possible ways to solve this issue, the most simple, but inefficient one, is to stall the pipeline by inserting two No-Operation instructions after the branch. This way, the instruction after the branch will do nothing, but in both situations, we loose a 2 clock cycles. Although this does not sound like much, considering that branching instruction make up about 25% of any assembly code [10], this can be a lot.

A way to ease the problem with this many wasted clock cycles is to move the branching calculation in the ID stage. A simple comparison unit for the register values is required, but the forwarding unit also needs to be be modified to forward results to the ID stage.

A better, but much more complex technique to solve this issue, is to have a dedicated "Prediction" unit in the processor, which will predict whether a branch will be taken or not, and fetch instructions to the IF stage accordingly.

#### 4.2 Implementation

The implementation of the pipelined simulator will group the devices in the machine together, so that each device or system has its own C structure.

#### 4.2.1 Simulator structures



Figure 7: Relations between the simulator structures.

In the simulator, the processor cores, pipeline registers, and memory unit all belong to a "hardware" structure, that binds them together, making up the state of the simulated machine. The relations between the components can be seen in figure 7. In the simulator, the hardware structure will be used to pass along the simulated hardware to functions, which then can access the underlying devices, modifying them properly.

#### 4.2.2 Pipeline functions

The pipeline is implemented by running each pipeline stage separately. For each clock-tick, the pipeline stage functions are being called in reverse. This can seem very unnatural, but this prevents pipeline registers to be overwritten before they are used. This also prevents an instruction to get throught the pipeline in one tick of the clock, and simplifies the assignment of the pipeline registers.

```
void tick() {
    interpret_wb(...);
    interpret_mem(...);
    interpret_ex(...);
    interpret_id(...);
    interpret_if(...);
}
```

Each of the functions receive the relevant pointer to the CPU core structure, which holds all the relevant information about the state of the core.

#### 4.2.3 Forwarding Unit

The pipeline stages work without regard for data hazards, which is where data is required for an instruction before it is stored by the previous instruction. Therefore, we add a forwarding unit function forwarding\_unit(), which is run by the end of each clock cycle. Being the last function, it has all the recent information in all the pipeline stages, which enables it to decide which values need to be forwarded. The forwarding unit only forwards values to the ALU input mutual exclusion units MUX A and MUX B. It does that by first checking if the previous instructions write to a register, that is, modify a register value. If it indeed is the case, the destination registers are compared, and if the match, the value from the previous instruction is bypassed. This is done for both the A and B input to the ALU in both Memoryand Writeback-stages. The code for forwarding values from stage MEM to EX can be seen in figure 8. Forwarding from the WB stage is almost identical.

```
/* Forward to A MUX */
if(EX_MEM.c_reg_write == 1
    && EX_MEM.reg_dst != 0
    && EX_MEM.reg_dst == ID_EX.rs)
        ID_EX.rs_value = EX_MEM.
        alu_res;

/* Forward to B MUX */
if(EX_MEM.c_reg_write == 1
    && EX_MEM.reg_dst != 0
    && EX_MEM.reg_dst == ID_EX.rt)
    ID_EX.rt_value = EX_MEM.
    alu_res:
```

Figure 8: ALU result being forwarded from the MEM stage to EX.

#### 4.2.4 Hazard detection unit

Control hazards arise when the processor is branching to another execution path based on the result of the previous instructions. As there is no "correct" solution to the branching hazard, the MIPS CPU manufacturers often implement their own branch predictions based on the intended use of the processor. For simplicity in our simulator, we will always predict that the branch is not taken. We will also simulate that the branching is decided in the ID stage<sup>5</sup>, so that only we only have 1 branch delay slot. In that case, we always run the next instruction after the branch. This works well, because this is a widely accepted approach in many RISC architectures [25]. Knowing that the instruction following a branch is always executed, many standard compilers and assemblers restructure the code so that branch-delay slot is not wasted.

For example, where assignment of a value to a register and a branch would consume 3 instructions, utilization of the branch-delay slot can bring it back to 2. The difference can be seen on figure 9. Note that the order of the instruction does not matter in that case.

```
      addi $v0, $0, 10
      beq $0, $0, exit

      beq $0, $0, exit
      addi $v0, $0, 10

      nop
      # ...
```

Figure 9: Wasted branch delay slot (left) and utilized branch delay slot (right).

# 5 Exceptions and Interrupts

As mentioned in section 3.6, exceptions and interrupts are a way to signal the CPU of an event, that needs immediate attention. In MIPS, everything that defers the usual execution path of a program to attend to another event, such as interrupts, traps, syscall, are called exceptions [21]. Exceptions are all handled the same way in MIPS.

# 5.1 Precise Exceptions

Due to the pipelined nature of MIPS (and other RISC architectures), exceptions can occur anywhere in the pipeline. That is, when an exception is raised, there are probably uncommitted instructions<sup>6</sup> in the pipeline. For simplicity of both hardware and software, it is desireable to handle an exception exactly when it occurs. In MIPS, precise exceptions means that all instructions preceding the exception victim<sup>7</sup>, are committed, while all instructions after are not.

Precise exceptions give the programmer the usual, sequential view of the program. However, precise exceptions in MIPS can be very expensive, potentially clearing the whole pipeline of otherwise faultless instructions.

#### 5.2 Handling Exceptions

Depending on the type of the exception, a different exception handler must be used. Many other architectures, such as the x84 and later x86\_64 use vectored interrupts. In vectored interrupts, a table of

<sup>&</sup>lt;sup>5</sup>Technically, we are deciding this in the interp\_ex function, but this will have the same effect due to the reverse execution of our pipeline stages.

<sup>&</sup>lt;sup>6</sup>Instructions, whose result is not yet written back to the registers.

<sup>&</sup>lt;sup>7</sup>The instruction that caused the exception.

addresses are kept in the Interrupt Vector Table (IVT), specifying start-addresses of different interrupt handlers. When an exception is raised, the type of exception is determined by some microcode or other hardware, and an appropriate handler is chosen from the IVT<sup>8</sup> [13] [21].

Although this approach is arguably faster, it is not used efficiently in practice. many modern operating systems, interrupt handlers have a lot of code in To avoid redundant code and large interrupt handlers, the handlers often jump to a shared location in which the redundant code is located, for example, to save the current registers on the stack. First, by jumping to a common location, the purpose of the IVT is (almost) defeated, as the time saved is being spent by jumping to another location anyhow. Secondly, the information about the interrupt retreived from pure hardware or microcode is very limited without building very complex hardware. A large OS would have to analyse the exception further.

Although MIPS32 has a type of an interrupt vector table [21], it is very rarely used due to its flaws mentioned above, and the fact that each interrupt handler is only 32 instructions long - something no real exception handler can fit into [21].

Due to the speeds of modern processors, MIPS32 exception handling routines always start at the same address. handler, when an exception arises. This exception handler is by default located in the unmapped and uncached memory segment KSEG1, address 0x80000080, although this can be changed by modifying the CP0 EBase register the status register bit BEV is set [21].

# 5.3 Interrupts

Hardware caused exceptions are often called "interrupts" in the MIPS world. While these interrupts are handled just as any other type of exception, they are not triggered in the code by the means of a syscall instruction. Since interrupts happen from outside the CPU, a system has to be implemented to notify the processor of the external event.

Like in many other processor architectures, two pins are added to the MIPS CPU, which are used to signal of a hardware interrupt occurance [12].

The first pin is called the Interrupt Request (IRQ), which indicates a pending interrupt. The second pin is the Interrupt Acknowledge (IACK), and is controlled by the processor to indicate that an interrupt request is acknowledged and is being serviced. The IACK pin is needed, since the CPU might be busy servicing another exception, not being able to attend the hardware interrupt immediatelly. Thus, the IRQ pin can be enabled over multiple clock-cycles before it is acknowledged [12]. While it may sound that having a hardware device wait several clock-cycles results in delayed output, it should be noted how fast MIPS CPUs (and many others) are, and how many instructions they are able to carry out before the external device even notices.

It should be noted that this is another point where MIPS32 deviates from its more mainstream x86 counterpart. While the MIPS processor knows that an hardware interrupt has occured, it does not know which device raised it. To detect the device that raised the interrupt, it must scan all the device IO registers, discussed in section 6. This approach is called "Polled Interrupts".

# 5.3.1 Polled Interrupts and Vectored Interrupts

In polled interrupts, the processor does not know the hardware source of the interrupt, and must therefore check the status IO register of each device in order to determine if the device needs to be serviced [12]. This method is used in MIPS, for simplicity of the hardware, and because exceptions are handled by generic exception handler code as well.

Another method of handling interrupts is using vectored interrupts, used by the x86 architecture. In vectored interrupts,

<sup>&</sup>lt;sup>8</sup>A x86 assembly programmer might note that int 0x80 in Linux calls the interrupt handler, located at index 128 in the IVT.

much like vectored exceptions, the processor jumps to a certain interrupt handling code, based on the interrupt type. However, since the event is external to the processor (caused by hardware), additional wires must be attached to the CPU, informing it of the IRQ number, which is unique for each type of device [14]. This requires a lot of added hardware complexity in the form of an interrupt controller, such as the well known Intel 8259 IRQ controller, which can queue up multiple interrupts, even prioritizing them according to the need of the CPU [14].

### 5.3.2 Exception procedure

Before exception handler code is executed, the processor must perform certain actions, required to correctly recover from the exception. These steps are [21]:

- 1. Setup the EPC register, which hold the address of the victim instruction. This is used when returning from the exception, so that the execution can continue from the correct position. EPC is very similar to the general-purpose \$RA, which also holds the return-address for user programs.
- 2. Flip on the EXL bit in the status register. This bit forces the CPU into kernel-mode, giving it higher privileges.
- 3. Set the Cause register, which holds all the information about the exception.
- 4. Jump to 0x80000080, where the exception handler code starts.

When the exception handler code is finished, the eret instruction should be used. Similar to the ret instruction, eret fetches the return address from EPC register, clears the EXL bit in the status register, and jumps to the return address.

# 5.4 Implementation

In the simulator, an enum structure exception\_t has been implemented, which holds all the necessary excetion types

(none-exception included). All functions in the simulator, that can cause an exception, such as reading from the memory, now return an exception to signal the simulator of the event. In the simulator, this exception is stored in the corresponding pipeline-register, forwarded to the next stages. When an instruction with an exception code other than "exception none" reaches the Write-Back stage, the exception handling mechanism is called.

Note that the exceptions are not necessarilly handled in the order they are raised, but rather, the order of the instructions. For example, in the following assembly code, a memory exception might occur in MEM at the same time as the arithmetic overflow exception in EX:

```
lw $t0, 10($t1)
add $t3, $t2, 100
```

However, due to the way exceptions are forwarded in the pipeline, the memory exception reaches WB stage first, and gets handled first.

#### 5.4.1 Interrupts

As mentioned in the interrupt section 5.3, interrupts need another way to signal the processor of an exception that needs attention. In the simulator, for every clock-cycle, we iterate over every external device that can trigger an interrupt. In the loop, we check for its interrupt status, and if needed, handle the exception.

In our simulator, the interrupt request is checked every clock-cycle. However, in some simulators, such as SPIM, a short artificial delay is added to enhance the realism of the simulator [9].

# 5.4.2 Preparing for exception handling

When an instruction raises an exception, handle\_exception(...) is called. This function follows the exception entrance procedure very closely, described in section

5.3.2. First, it clears all previous pipeline stages, as they now are discarded. The WB stage is kept, as we need to extract some information about the state. Then, the cause is determined by analyzing the exception, and stored in the CAUSE register. Privileged mode (kernel mode) is entered and PC is set to the exception-handler address to force the jump. In the end, stage WB is also cleared to ensure no faulty results are written back to. It is to be noted that this function is *not* pipelined in the simulator, giving the impression that all this happens in only one clock cycle.

Given that this procedure is executed regardless the type of the exception, one might argue that this might be implemented in microcode or hardware, really only using 1 clock-cycle.

# 6 Memory Mapped IO

There are multiple ways a computer can communicate with its peripheral devices. The most used techniques are Memory-Mapped I/O (MMIO) and Port-Mapped I/O (PMIO). Port-Mapped I/O is an older method of accessing peripherals, and is much more common in consumer computers with Intel or AMD CPUs [7].

PMIO requires the processor to have additional pins and special instructions, just for communication with I/O devices. I/O devices have a specified port number, which the processor uses to identify and communicate with the device, much like an address. For example, Intel x86 processors might be polling<sup>9</sup> a keyboard using the special in instruction, specifying keyboard ID 0x60 [8] [15]:

```
_wait:
in al, 64H ; Read status
and al, 1 ; Check if ready
jz _wait
in al, 60H ; Read data
```

Besides the added circuitry in the processor, port-mapped I/O are fairly simple to

implement and easy to use. They have their own special instructions, and they do not share memory space, which prevents confusion about memory segmenting. However, PMIO approach is very limited to only in and out instructions, as well as very limited "ID" space for the devices.

Memory-Mapped I/O is commonly used by RISC architectures. In MMIO, communication with the peripherals happens over the same address bus as the memory, so that memory and I/O communication share address space. This means that when a processor, that uses MMIO, wants to access a peripheral, it uses a memory address for communication.

The advantage of memory-mapped I/O is that the method uses already existing memory bus lanes for communication, unlike the port implementation. However, since the memory is shared with peripherals, and these addresses are not maintained by the processor, additional logic has to be created in the memory-management unit. ever, PMIO is very limited to the simple in / out instructions, as well as a hardwarelimited device allocation, whereas MMIO is much more free with device "address" allocation. Due to these reasons, memorymapped I/O is becoming increasingly popular, even on modern x86 hardware [15]. MIPS sees Input/Output (IO) devices as a set of special-purpose registers. These special registers are the processors only way of communicating with a given device.

For every device, there 3 types of registers [1] [20]:

#### • Status registers

Provide information about the underlying device. These registers are readonly for the CPU.

#### • Control registers

Used to communicate and control the device. These registers are writeable for the CPU, but may not always be readable.

#### • Data registers

Used for the actual data-transport. For example, the latest key pressed on

<sup>&</sup>lt;sup>9</sup>Polling refers to the activity where a CPU periodically samples the status of an external device

the keyboard might be stored in this register.

In memory, these registers have allocated 4 bytes each, and are always located sequentially in the memory. Figure 10 shows the actual bit-allocation in the register.



Figure 10: Set of memory mapped I/O registers [20]

When a device is detected, the memorymanagement unit will allocate an address for the device, notifying both the CPU and the device of these newly allocated registers. From now on, it is up to the driver to interface with the devices.

Depending on the type of the IO device, the device can be represented from a few registers to dozens. For example, a mouse or a keyboard only transmits few bytes of information at a time, needing only a few registers, while graphic adaptors or disk drives might need more [1].

These special-purpose registers are located in the RAM, mapped to a certain segment. A device controller maintains a list of these registers, and maps new devices to the memory. [2]

On MIPS32, the highest 64 kilobytes (0xffff0000 - 0xffffffff) in the available memory are used for mapping these special registers [1]. The memory mapping the IO area must not be cached, as it can cause major problems for the caches, and heavy delays due to many cache-misses [21]. However, it is obvious that IO registers can be very inconvenient for larger and fast data-transfer, due to the very limited register size. To dodge this bottlenect, memory controllers are implemented to give IO devices direct access to larger parts the main memory, without the need of going through

the CPU first.

# 6.1 Direct Memory Access

Direct Memory Access controllers, also called DMA controllers, are features that allows certain hardware to access the main memory independently of the CPU. In MIPS, this feature is built-in in the MMU. Larger transfers between hardware and the memory is usually initiated through the device registers, where the CPU sets up the MMU and communicates the target address. Then, the IO device is free to read and write in the target address. When the peripheral is finished with its actions on the memory, it will usually notify the CPU using an interrupt, having the processor taking care of the rest.

Direct Memory Access is often utilized by peripherals with heavy IO use, such as the storage drives, as well as network-, sound-, and graphics-cards.

Because KUDOS does not read large portions of data from its IO devices, nor is there any DMA support in the current YAMS simulator, DMA will not be implemented.

# 6.2 Detecting and mapping peripherals

When a computer is turned on, or a new peripheral device is plugged in, the MIPS processor and its MMU automatically map the device to a given address, allocating the necessary IO registers. All this happens seamlessly and without the interaction of the operating system. This is problematic, because the OS needs to know the type of the device and the address of its memory mapped registers, amongst other things. There is no real way to communicate this between the processor and the operating To make matters worse, each system. processor has its own addresses for the registers, and a slightly different configuration. On top of that, each board has its own set of external components, busses and metadevices [3] [27]. Instead of compiling a very specialized kernel for seemingly every configuration of a System on a Chip (SoC), operating system and bootloader developers have designed a specification devicetree, which is used to describe system hardware [3].

The devicetree specification can describe a lot of aspects of a hardware devices, such as the interrupt number, size of the IO registers, even the device clock speeds.

The Linux kernel also enjoys this devicetree specification, for both ARM and MIPS architecuters [22,linux-4.6/arch/mips/boot/dts], reducing the time and complexity implementing a new board or a SoC.

In KUDOS, this data structure is unfortunatelly not used, because the hardware it is run on is very limited, consisting almost only of simulated devices. As a result of that, KUDOS looks for at a predefined address for data structures describing the devices, called the device descriptors.

# 6.3 Device descriptors

To detect and communicate with external devices, device descriptors are used. A device descriptor is a data structure which, just like a devicetree, specifies all the relevant information about the device, such as its type, name, register addresses and so on. The format of the device descriptor is defined in Yet Another MIPS Simulator, the current simulator used to run KUDOS. The same device descriptor structure is defined in KUDOS, so that KUDOS can interface with the underlying simulator, retrieving the relevant information. The structure is defined in kudos/drivers/mips32/arch.h [11] as:

```
typedef struct {
    uint32_t type;
    uint32_t io_area_base;
    uint32_t io_area_len;
    uint32_t irq;
    char vendor_string[8];

    /* Reserved area (unused). */
    uint32_t resv[2];
} io_descriptor_t;
```

#### TO BE COMPLETED.

# 6.4 Implementation

Implementation of external devices requires a lot of new modules and features in the simulator. The main issue is that we require to communicate with events, not necessarily handled by the simulator.



Figure 11: Memory layout of IO device descriptors and registers.

#### $7 \quad \text{TLB}$

#### 8 MMU

Most processors strive to help the OS and its developers to ensure optimal utilization of the main memory, as well as the overall security of the programs memory usage.

In most systems, it is a common practice to split the memory into multiple parts, where each part can only be accessed by the intended user. For example, a user program should never be able to modify or even read memory of the kernel. If that was possible, a single program would be able to break the running OS by overwriting some essential values, or even retrieve passwords from variables in other programs.

These mappings abstract the program addresses from the physical addresses, and translation has to be implemented, to map these two address spaces. This is usually done by the Memory Management Unit

(MMU).

One of the big differences of the MIPS architecture is that many components in the processor are completelly optional, and up for the manufacturer to decide whether they suit the needs of the desired application, be it embedded device or a supercomputer.

The memory management in MIPS32 is no exception, as MMU is completely optional. However, on MIPS32, to still ensure some basic form of user-restricted memory areas, the memory is split into 4 areas, with each a designated usage: [18] [21]

• *kseg0*, 0x80000000-0x9fffffff, 512MiB

The first kernel segment is which is translated by stripping of the first bit in the address field. This segment is using caching, and can therefore first be used when caches have been initiated. [21]

• kseg1, 0xa0000000-0xbfffffff, 512MiB

Kernel segment which is *not* cached. It is the only segment that is guaranteed to be available immediately after a system reset (or boot), where no other CPU devices are initialized. This is the segment where bootup-code is stored.

- kuseg, 0x00000000-0x7ffffffff, 2GiB
   This is the only segment that can be used by user programs. It is mapped and cached, so the OS needs to initialize caches and possible MMU for this segment to be useable.
- kseg2, 0xc000000-0xffffffff, 1GiB
  This segment is for additional access
  modes, such as the kernel mode. This
  segment is mapped and cached, and
  thus, cannot be used immediately after bootup.

#### 8.1 Privilege levels

When the CPU starts up, it is by default in kernel mode. In this mode, it has the privilege to access all addresses (along with many other things). The kernel will retain this privilege until it starts user programs, in which case, the OS flips the KUc bit in the co-processor0 status register [24].

- 8.2 Translation Lookaside Buffer TODO
- 8.3 Implementation

12

- 9 User and Kernel mode
- 10 SMP
- 11 Tests
- 12 Performance
- 13 Conclusion



Figure 12: Main memory mapping in the simulator.

# References

- [1] Jason W. Bacon. Mips assembly language programming. http://www.cs.uwm.edu/classes/cs315/Bacon/Lecture/HTML/ch14s03.html, 2010-2011. [Online; accessed 09-May-2016, (Archived by WebCite at http://www.webcitation.org/6hPM4fNga)].
- [2] Robert Britton. Mips assembly language programming, 2002. Computer Science Department California State University, Chico Chico, California.
- [3] devicetree.org. Devicetree specification, release 0.1-pre1-20160430, 30 April 2016. Available at: http://webdev.linaro.org/devicetree.org/specifications/ and https://github.com/devicetree-org/devicetree-specification-released.
- [4] Gustavo Duarte. Anatomy of a program in memory. http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/, 2009.
  [Online; accessed 9-April-2016, Archived by WebCite (R) (at http://www.webcitation.org/6gf8eojPM].
- [5] Michael Flynn. Computer architecture : pipelined and parallel processor de-

- sign. Jones and Bartlett, Boston, MA, 1995.
- [6] Johan De Gelas. Itanium is there light at the end of the tunnel?, 2005. [Online; accessed 28-March-2016].
- [7] intel. Intel® 64 and IA-32 ArchitecturesSoftware Developer's Manual. December 2015. Combined Volumes:1, 2A, 2B, 2C, 3A, 3B, 3C and 3D.
- [8] intel. ntel® 7 Series / C216 Chipset Family Platform Controller Hub (PCH). June 2012. Order Number: 326776-003.
- [9] James R. Larus. Appendix: Assemblers, linkers, and the spim simulator, 2014. Appendix to Computer Organization and Design: The Hardware/-Software Interface.
- [10] Dr. Lawlor. Instruction encoding & frequency. https://www.cs.uaf.edu/2012/spring/cs641/lecture/01\_24\_encoding.html, 2012. [Online; accessed 8-April-2016, Archived by WebCite (Rhttp://www.webcitation.org/6gdfU63W0].
- [11] University of Copenhagen. Kudos—københavns universitets didactic operating system. https://github.com/DIKU-EDU/kudos.
- [12] University of Pittsburgh. Exception and interrupt handling in the mips

- architecture. https://people.cs.pitt.edu/~don/coe1502/current/ Unit4a/Unit4a.html. [Online; accessed 09-May-2016, (Archived by WebCite at http://www.webcitation.org/6hPGPDvWy)].
- [13] OSDev.org. Interrupt vector tahttp://wiki.osdev.org/ ble. Interrupt\_Vector\_Table, 2016. [Online; 07-April-2016, accessed (Archived by WayBack chine https://web.archive. org/web/20160427193853/http: //wiki.osdev.org/Interrupt\_ Vector\_Table)].
- [14] OSDev.org. Interrupts. http://wiki.osdev.org/Interrupts,
  2016. [Online; accessed 19-May2016, Archived by WayBack Machine at https://web.archive.
  org/web/20160519163614/http:
  //wiki.osdev.org/Interrupts].
- [15] OSDev.org. I/o ports. http://wiki.osdev.org/I/O\_Ports,
  2016. [Online; accessed 09-May2016, (Archived by WayBack machine at https://web.archive.
  org/web/20160512212121/http:
  //wiki.osdev.org/I/O\_Ports)].
- [16] David Patterson. Computer organization and design: the hardware/software interface. Morgan Kaufmann, Oxford Waltham, MA, USA, 2014.
- [17] David A. Patterson and David R. Ditzel. The case for the reduced instruction set computer. SIGARCH Comput. Archit. News, 8(6):25–33, October 1980.
- [18] Imagination Technologies Group Plc. Mips microprocessors overview. http://cdn.imgtec.com/mips-training/mips-basic-training-course/slides/Memory\_Map.pdf.
  [Online; accessed 18-April-2016, Archived by WebCite (Rat http://www.webcitation.org/6guotdraU)].
- [19] Imagination Technologies Group Plc. Mips microprocessors

- overview. https://imgtec.com/?do-download=4408, 2014. [Online; accessed 29-March-2016].
- [20] Imagination Technologies Group Plc. Mips® architecture for programmers vol. iii: Mips32® / micromips32<sup>TM</sup> privileged resource architecture. https://imgtec.com/?do-download=5145, July 10, 2015. [Document Number: MD00090, Revision 6.02].
- [21] Dominic Sweetman. See MIPS run. Morgan Kaufmann Publishers/Elsevier, San Francisco, Calif, 2007.
- [22] Linus Torvalds. Linux stable kernel, mainline 4.6, 2016-05-15. https://www.kernel.org/.
- [23] Central Connecticut State University.

  Memory layout. http://chortle.
  ccsu.edu/assemblytutorial/
  Chapter-10/ass10\_3.html, 2015.
  [Online; accessed 10-April-2016,
  Archived by WebCite (Rat http://
  www.webcitation.org/6gfKPa3Iu].
- [24] Matt Welsh. Mips r2000/r3000 architecture. http://www.eecs.harvard.edu/~mdw/course/cs161/handouts/mips.html, 2005. [Online; accessed 20-April-2016, Archived by WebCite (Rat http://www.webcitation.org/6guuHNcwJ].
- [25] Wikipedia. Branch delay slots. https://en.wikipedia.org/wiki/Delay\_slot#Branch\_delay\_slots, 2016. [Online; accessed 9-April-2016, Archived by WebCite (Rat http://www.webcitation.org/6geBmObpx].
- [26] Wikipedia. Clock rate. https://en. wikipedia.org/wiki/Clock\_rate, 2016. [Online; accessed 7-April-2016].
- [27] Xillybus.com. A tutorial on the device tree (zynq) part i. http://xillybus.com/tutorials/device-tree-zynq-1. [Online; accessed 20-May-2016, (Archived by WebCite at http://www.webcitation.org/6hdeUTNk1)].

# Appendices