### x86 Initial Boot Sequence

Advanced Operating Systems and Virtualization
Alessandro Pellegrini
A.Y. 2019/2020



## The whole sequence at a glance







### **Boot Sequence**

**BIOS/UEFI** 

The actual Hardware Startup

Bootloader Stage 1

Executes the Stage 2 bootloader (skipped in case of UEFI)

Bootloader Stage 2

Loads and starts the Kernel

Kernel Startup

The Kernel takes control of and initializes the machine (machine-dependent operations)

Init

First process: basic environment initialization (e.g., SystemV Init, systemd)

Runlevels/Targets

Initializes the user environment (e.g., single-user mode, multiuser, graphical, ...)





### Hardware Power Sequences: The Pre-Pre-Boot

- When someone pushes the power button, the CPU can't simply jump up and start fetching code from flash memory
- The hardware waits for the power supply to settle to its nominal state
- Additional voltages must be supplied:
  - On x86 systems: 1.5, 3.3, 5, and 12 V
  - Power Sequencing: these must be provided in a particular order





### Hardware Power Sequences: The Pre-Pre-Boot

- The power is sequenced by controlling analog switches, typically field-effect transistors
- The sequence is often driven by a Complex Program Logic Device (CPLD)
- Platform clocks are derived from a small number of input clock and oscillator sources.
  - The devices use phase-locked loop circuitry to generate the derived clocks used for the platform.
  - These clocks take time to converge.





### Hardware Power Sequences: The Pre-Pre-Boot







### Initial life of the System

- The power-sequencing CPLD can de-assert the reset line to the processor
- At this point, the system is in a very basic state:
  - Caches are disabled
  - The Memory Management Unit (MMU) is disabled
  - The CPU is executing in **Real Mode** (8086-compatible)
  - Only one core can run actual code
  - Nothing is in RAM (what to execute?)





## Segmented Memory



logical address = <seg.id : offset> (e.g. <A : 0x10>)



## Segmentation-based addressing

- There are 4 basic 16-bit segment registers:
  - CS: code segment
  - DS: data segment
  - SS: stack segment
  - ES: extra segment (to be used by the programmer)
- Intel 80386 (1985) added two new registers
  - FS and GS, with no predefined usage





## Segmentation-based addressing

The CPU resolves addresses as:







### Segmentation Nowadays

- Segmentation is still present and always enabled
- Each instruction that touches memory implicitly uses a segment register:
  - a jump instruction uses CS
  - a push instruction uses SS
- Most segment registers can be loaded using a mov instruction
- CS can be loaded only with a jmp or a call





#### x86 Real Mode

- 16-bit instruction execution mode
- 20-bit segmented memory address space
  - 1 MB of total addressable memory
- Address in segment registers is the 16-bits higher part
- Each segment can range from 1 byte to 65,536 bytes (16-bit offset)





## Real Mode Addressing Resolution







Addressing in x86 Real Mode

FFFF: FFFF Segment 3 Segment address: 0x28C0 Start of segment 3 Address: 0x28C0:0000 - or -0x2143:0x77D0 Linear address: 0x28C00 Growing Start of segment Segment 2 Physical Address: 0x2143:0000 Segment address: 0x2143 Linear address: 0x21430 Addresses Segment 1 Segment address: 0x0CEF Start of segment Address: 0x0CEF:0000 Linear address: 0x0CEF0 0000:0000





Addressing in x86 Real Mode

Segment 3 Segment address: 0x28C0 Start of segment 3 Address: 0x28C0:0000 - or -0x2143:0x77D0 Linear address: 0x28C00 Start of segment Segment 2 Address: 0x2143:0000 Segment address: 0x2143 Linear address: 0x21430 Segment 1 Segment address: 0x0CEF Start of segment Address: 0x0CEF:0000 Linear address: 0x0CEF0

FFFF:FFFF

Weren't they 20 bits?

Growing Physical Addresses

0000:0000

Main memory



Addressing in x86 Real Mode

Segment 3 Segment address: 0x28C0 Start of segment 3 Address: 0x28C0:0000 - or -0x2143:0x77D0 Linear address: 0x28C00 Start of segment Segment 2 Address: 0x2143:0000 Segment address: 0x2143 Linear address: 0x21430 Segment 1 Segment address: 0x0CEF Start of segment Address: 0x0CEF:0000 Linear address: 0x0CEF0

FFFF:FFFF

Weren't they 20 bits?

Largest address is FFFFF!

Growing Physical Addresses

0000:0000

Main memory



#### First Fetched Instruction

- The first fetched address is F000: FFF0
  - This is known as the *reset vector*
  - On IBM PCs this is mapped to a ROM: the BIOS
  - This gives space only to 16 bytes from the top of ROM memory:

ljmp \$0xf000,\$0xe05b

This is where the BIOS code is loaded



### **BIOS Operations**

- The BIOS first looks for video adapters that may need to load their own routines
  - These ROMs are mapped from C000:0000 to C780:0000
- Power-On Self-Test (POST)
  - Depends on the actual BIOS
  - Often involves testing devices (keyboard, mouse)
  - Video Card Initialization
  - RAM consistency check





### **BIOS Operations**

- Boot configuration loaded from CMOS (64 bytes)
  - For example, the boot order
- Shadow RAM initialization
  - The BIOS copies itself into RAM for faster access
- The BIOS tries to identify the Stage 1 bootloader, (512 bytes) using the specified boot order and loads it to memory at 0000:7c00
- Control is given with:

ljmp \$0x0000,\$0x7c00





### The RAM after the BIOS startup

**BIOS ROM** 

16-bit devices, expansion ROM

**VGA** Display

Low Memory

———— 0x00100000 **(1 Mb)** 

← 0x000F0000 (960 Kb)

← 0x000C0000 (768 Kb)

0x000A0000 (640 Kb)

The bootloader is loaded here

\_\_\_\_\_ 0x0000000

The only available "RAM" in the early days





### **Boot Sequence**

**BIOS/UEFI** 

The actual Hardware Startup

Bootloader Stage 1

Executes the Stage 2 bootloader (skipped in case of UEFI)

Bootloader Stage 2

Loads and starts the Kernel

Kernel Startup

The Kernel takes control of and initializes the machine (machine-dependent operations)

Init

First process: basic environment initialization (e.g., SystemV Init, systemd)

Runlevels/Targets

Initializes the user environment (e.g., single-user mode, multiuser, graphical, ...)





#### The Boot Sector

- The first device sector keeps the so called Master Boot Record (MBR)
- This sector keeps executable code and a 4-entry partition table, identifying different device partitions (in terms of its positioning on the device)
- In case the partition is extended, then it can additionally keep up to 4 sub-partitions (extended partition)





## The Device Organization



- Boot sector: it can contain additional boot code
- Extended partition boot record





- This implements the Stage 1 bootloader
- (Less than) 512 bytes can be used to load the operating system

| Offset | Size (bytes)                             | Description                                     |
|--------|------------------------------------------|-------------------------------------------------|
| 0      | 436 (to 446, if you need a little extra) | MBR Bootstrap (flat binary executable code)     |
| 0x1b4  | 10                                       | Optional "unique" disk ID <sup>1</sup>          |
| 0x1be  | 64                                       | MBR Partition Table, with 4 entries (below)     |
| 0x1be  | 16                                       | First partition table entry                     |
| 0x1ce  | 16                                       | Second partition table entry                    |
| 0x1de  | 16                                       | Third partition table entry                     |
| 0x1ee  | 16                                       | Fourth partition table entry                    |
| 0x1fe  | 2                                        | (0x55, 0xAA) "Valid bootsector" signature bytes |





- The initial bytes of the MBR can contain the BIOS Parameter Block (BPB)
- It is a data structure describing the physical layout of a data storage volume
  - It is used, e.g., by FAT16, FAT32, and NTFS
- This eats up additional space, and must be placed *at the beginning* of the MBR!
  - How to execute the code?



```
.code16
                               Sides: .short 2
 .text
                               HiddenSectors: .int 0
.globl start;
                               LargeSectors: .int 0
                               DriveNo: .short 0
                               Signature: .byte 41 #41 = floppy
start:
jmp .stage1 start
                               VolumeID: .int 0x00000000
                               VolumeLabel: .string "myOS"
OEMLabel: .string "BOOT"
                               FileSystem: .string "FAT12"
BytesPerSector: .short 512
SectorsPerCluster: .byte 1
                                .stage1 start:
ReservedForBoot: .short 1
                                   cli # Disable interrupts
NumberOfFats: .byte 2
                                   xorw %ax, %ax # Segment zero
RootDirEntries: .short 224
                                   movw %ax, %ds
Logical Sectors: .short 2880
                                   movw %ax, %es
MediumByte: .byte 0x0F0
                                   movw %ax, %ss
SectorsPerFat: .short 9
SectorsPerTrack: .short 18
```



```
.code16
                               Sides: .short 2
 .text
                               HiddenSectors: .int 0
.globl start;
                               LargeSectors: .int 0
                               DriveNo: .short 0
                               Signature: .byte 41 #41 = floppy
start:
jmp .stage1 start
                               VolumeID: .int 0x00000000
                               VolumeLabel: .string "myOS"
OEMLabel: .string "BOOT"
                               FileSystem: .string "FAT12"
BytesPerSector: .short 512
SectorsPerCluster: .byte 1
                                .stage1 start:
ReservedForBoot: .short 1
                                   cli # Not safe here!
NumberOfFats: .byte 2
                                   xorw %ax, %ax # Segment zero
RootDirEntries: .short 224
                                   movw %ax, %ds
Logical Sectors: .short 2880
                                   movw %ax, %es
                                                 What about CS?
MediumByte: .byte 0x0F0
                                   movw %ax, %ss
SectorsPerFat: .short 9
SectorsPerTrack: .short 18
```



### The Stage 1 Bootloader must...

- Enable address A20
- Switch to 32-bit protected mode
- Setup a stack
- Load the kernel
  - Yet, the kernel is on disk: how to navigate the file system? There is not much space for code...
  - Load the Stage 2 bootloader!





#### A20 Enable

- Intel 80286 increased the addressable memory to 16 Mb (24 address lines)
- How to keep backward compatibility with 8086?
  - "wrap-around" problem
  - By default address line 20 is forced to zero!
- How to enable/disable this line?
  - Use the 8042 keyboard controller (sic!)
  - It had a spare pin which they decided to route the A20 line through



#### A20 Enable

- The output port of the keyboard controller has a number of functions.
- Bit 1 is used to control A20:
  - -1 = enabled
  - -0 = disabled
- Port 0x64 is used to "communicate" an operation to the controller
  - 0xd1 means "write"
- 0xdd and 0xdf enable/disable A20, when sent to port 0x60
  - You have to wait for previous operations to complete (the controller is slow)





#### A20 Enable

```
call wait for 8042
 movb $0xd1, %al #command write
  outb %al, $0x64
  call wait for 8042
 movb $0xdf, %al # Enable A20
  outb %al, $0x60
  call wait for 8042
wait for 8042:
  inb %al, $0x64
  tesb $2, %al # Bit 2 set = busy
  jnz wait for 8042
  ret
```



#### x86 Protected Mode

- This execution mode was introduced in 80286 (1982)
- With 80386 (1985) it was extended by adding paging
- CPUs start in Real Mode for backwards compatibility
- Still today, x86 Protected Mode must be activated during system startup



### x86\_64 Registers





# x86\_64 Registers

| ZMMO  | ) YMI    | <b>10</b> N  | XMM0  | ZN  | 1M1           | ١    | /MM1    | XMM1  | ST(0)  | MM0   | ST(1)         | MM1         | AH  | AX EAX  | RAX      | R8 [888     | R8D R1         | 2 R128 R12W R12D    | CR0   | CR4  |       |
|-------|----------|--------------|-------|-----|---------------|------|---------|-------|--------|-------|---------------|-------------|-----|---------|----------|-------------|----------------|---------------------|-------|------|-------|
| ZMM2  | 2 YMI    | <b>1</b> 2 [ | XMM2  | ZN  | 1M3           | ١    | YMM3    | XMM3  | ST(2)  | MM2   | ST(3)         | MM3         | 811 | BX EBX  | RBX      | R9          | R9D R1         | 3 R138 R13W R13D    | CR1   | CR5  |       |
| ZMM4  | YMI      | <b>14</b> [  | XMM4  | ZN  | 1M5           | V    | YMM5    | XMM5  | ST(4)  | MM4   | ST(5)         | MM5         | [CH | icx ECX | RCX      | R10 RIGRION | R10D R1        | 4 R140 R14D         | CR2   | CR6  |       |
| ZMM6  | YMI      | <b>И</b> 6 [ | XMM6  | ZN  | 1M7           | ١    | YMM7    | XMM7  | ST(6)  | MM6   | ST(7)         | MM7         | Он  | EDX EDX | RDX      | R11         | R11D <b>R1</b> | _5 [R158 R15W] R15D | CR3   | CR7  |       |
| ZMM8  | 3 YMI    | /18 [        | XMM8  | ZN  | 1M9           | ١    | /MM9    | XMM9  |        |       |               |             |     | BP EBP  | RBP      | DI EDI F    | RDI            | ESP RIP             | CR3   | CR8  |       |
| ZMM1  | .0 YMY   | /10 C        | XMM10 | ZN  | 1M1:          | 1    | /MM11   | XMM11 | CW     | FP_IP | FP_DP         | FP_CS       |     | SI ESI  | RSI      | SP ESP F    | RSP            |                     | MSW   | CR9  |       |
| ZMM1  | .2 YM    | <b>/12</b> [ | XMM12 | ZN  | 1M13          | 3    | /MM13   | XMM13 | SW     |       |               |             |     |         |          |             |                |                     |       | CR10 | )     |
| ZMM1  | .4 YM    | /14 [        | XMM14 | ZN  | 1M15          | 5    | /MM15   | XMM15 | TW     |       |               | it Register |     |         |          |             | •              | 16-bit Re           |       | CR11 | _     |
| ZMM16 | ZMM17 ZN | ИМ18         | ZMM19 | ZMN | <b>Л</b> 20 Z | ZMM2 | 1 ZMM22 | ZMM23 | FP_DS  |       | <b>512-</b> 0 | it Register |     | 128-010 | Register | 32-bit      | Register       | 8-bit Reg           | ister | CR12 | 2     |
| ZMM24 | ZMM25 ZN | 1M26         | ZMM27 | ZMN | л28 Z         | ZMM2 | 9 ZMM30 | ZMM31 | FP_OPC | FP_DP | FP_IP         | C           | S   | SS      | DS       | GDTR        | IDTR           | DR0                 | DR6   | CR13 | 3     |
|       |          |              |       |     |               |      |         |       |        |       |               | Е           | S   | FS      | GS       | TR          | LDTR           | DR1                 | DR7   | CR14 | l.    |
|       |          |              |       |     |               |      |         |       |        |       |               |             |     |         |          | RFLAGS      | FFI AGS FLAGS  | DR2                 | DR8   | CR15 | MXCSR |
|       |          |              |       |     |               |      |         |       |        |       |               |             |     |         |          | 111 127 103 |                | DR3                 | DR9   |      |       |
|       |          |              |       |     |               |      |         |       |        |       |               |             |     |         |          |             |                | DR4                 | DR10  | DR12 | DR14  |
|       |          |              |       |     |               |      |         |       |        |       |               |             |     |         |          |             |                | DR5                 | DR11  | DR13 | DR15  |





### CR0

| Bit | Name | Full Name                | Description                                                                                            |  |  |  |  |  |  |  |  |
|-----|------|--------------------------|--------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| 0   | PE   | Protected Mode<br>Enable | If 1, system is in protected mode, else system is in real mode                                         |  |  |  |  |  |  |  |  |
| 1   | MP   | Monitor co-processor     | Controls interaction of WAIT/FWAIT instructions with TS flag in CR0                                    |  |  |  |  |  |  |  |  |
| 2   | EM   | Emulation                | If set, no x87 FPU is present, if clear, x87 FPU is present                                            |  |  |  |  |  |  |  |  |
| 3   | TS   | Task switched            | Allows saving x87 task context upon a task switch only after x87 instruction used                      |  |  |  |  |  |  |  |  |
| 4   | ET   | Extension type           | On the 386, it allowed to specify whether the external math coprocessor was an 80287 or 80387          |  |  |  |  |  |  |  |  |
| 5   | NE   | Numeric error            | Enable internal x87 floating point error reporting when set, else enables PC style x87 error detection |  |  |  |  |  |  |  |  |
| 16  | WP   | Write protect            | When set, the CPU can't write to read-only pages when privilege level is 0                             |  |  |  |  |  |  |  |  |
| 18  | AM   | Alignment mask           | Alignment check enabled if AM set, AC flag (in EFLAGS register) set, and privilege level is 3          |  |  |  |  |  |  |  |  |
| 29  | NW   | Not-write through        | Globally enables/disable write-through caching                                                         |  |  |  |  |  |  |  |  |
| 30  | CD   | Cache disable            | Globally enables/disable the memory cache                                                              |  |  |  |  |  |  |  |  |
| 31  | PG   | Paging                   | If 1, enable paging and use the CR3 register, else disable paging                                      |  |  |  |  |  |  |  |  |





## **Entering Basic Protected Mode**

- The code must set bit 0 (PE) of register CR0
- Setting PE to 1 does not immediately activate all its facilities
- It happens when the CS register is first updated
- This can be only done using a far jump (ljmp) instruction, as already mentioned.
- After this, code executes in 32/64-bit mode



### **Entering Basic Protected Mode**

```
ljmp 0x0000, PE mode
 .code32
PE mode:
 # Set up the protected-mode data segment
registers
movw $PROT MODE DSEG, %ax
movw %ax, %ds
movw %ax, %es
movw %ax, %fs
movw %ax, %qs
movw %ax, %ss
```

#### Segment Registers in Protected Mode

- In Protected Mode, a segment is no longer a raw number
- It contains (also) an index into a table of segment descriptors
- There are three types of segments:
  - code
  - data
  - system





### Descriptor Table Entry



- **Base**: 32-bit linear addressing pointing to the beginning of the segment
- Limit: size of the segment
- **G**: *Granularity*. If set, size is to be multiplied by 4096
- **Descriptor Privilege Level** (DPL): a number in [0-3] to control access to the segment





### Protected Mode: Privilege Levels



Ring 3 has restricted access to memory management, instructions execution (around 15 allowed only at ring 0), and I/O ports





### Descriptor Tables

- Two tables are available on x86 architectures
- Global Descriptor Table (GDT):
  - This is a system-wide table of descriptors
  - It is pointed by the GDTR register
- Local Descriptor Table (LDT):
  - Pointed by the LDTR register
  - Not used anymore





# Segment Selectors



- **TI**: set to 0 for the GDT, set to 1 for the LDT
- Index: specifies the segment selector within the associated table
- Requested Privilege Level (RPL): we'll come to that later





# Segmented Addressing Resolution







# Segmented Addressing Resolution







# Segment Caching

- Accessing the GDT for every memory access is not performance-wise
- Segment registers have a non-programmable hidden part to store the cached descriptor







# x86 Enforcing Protection

- A Descriptor Entry has a DPL
- The firmware must check if an access to a certain segment is allowed
- There must be a way to change current privilege





# Data Segment vs Code Segment

- RPL is present only in data segment selectors (e.g. SS or DS)
- Current Privilege Level
   (CPL): this is only in CS,
   which can be loaded only
   with a limp/lcall



Overall we have 3 different privilege-level fields:
 CPL, RPL, and DPL





# Protection upon Segment Load

• CPL is managed by the CPU: it's *always* equals to the current CPU privilege level

- CPU Memory protection comes at two points:
  - Memory access via a linear address
  - Data segment selector load operation





# Protection upon Segment Load







# Getting Higher Privileges

- Accessing segment with a higher privilege (lower ring) with no control might allow malicious code to subvert the kernel
- To control transfer, code must pass through a controlled gate

• **Gate descriptors** are used to identify possible gates through which control can pass





# Controlled Access Through Gates

Kernel Space (Ring 0)

User Space (Ring 3)



Non-admitted cross-segment jump





### Gate Descriptors

- A gate descriptor is a segment descriptor of type *system*:
  - Call-gate descriptors
  - Interrupt-gate descriptors
  - Trap-gate descriptors
  - Task-gate descriptors
- These are referenced by the Interrupt
   Descriptor Table (IDT), pointed by the IDTR register





#### IDT and GDT Relations





#### GDT in Linux

| Linux's GDT | Segment Selectors | Linux's GDT         | Segment Selectors              |
|-------------|-------------------|---------------------|--------------------------------|
| null        | 0x0               | TSS                 | ox80 ← Different for all cores |
| reserved    |                   | LDT                 | ox88 ← Shared across all cores |
| reserved    |                   | PNPBIOS 32-bit code | 0x90                           |
| reserved    |                   | PNPBIOS 16-bit code | 0x98                           |
| not used    |                   | PNPBIOS 16-bit data | 0xa0                           |
| not used    |                   | PNPBIOS 16-bit data | 0xa8                           |
| TLS#1       | 0x33              | PNPBIOS 16-bit data | 0xb0                           |
| TLS #2      | 0x3b              | APMBIOS 32-bit code | 0xb8                           |
| TLS #3      | 0x43              | APMBIOS 16-bit code | 0xc0                           |
| reserved    |                   | APMBIOS data        | 0xc8                           |
| reserved    |                   | not used            |                                |
| reserved    |                   | not used            | 1                              |
| kernel code | 0x60 ( KERNEL_CS) | not used            |                                |
| kernel data | 0x68 ( KERNEL DS) | not used            | 1                              |
| user code   | 0x73 ( USER CS)   | not used            |                                |
| user data   | Ox7b (_USER_DS)   | double fault TSS    | 0xf8                           |

There is one copy of this table for each core





- Its a structure keeping information about a task
- It is intended to handle task management
- It stores:
  - Processor registers state
  - I/O Port Permissions
  - Inner-level Stack Pointers
  - Previous TSS link





- It can be everywhere in memory (hence the GDT entry required to access it)
- On Linux, it's in kernel data memory
- Each TSS is stored in the int\_tss array.
- The selector is kept in the Task Register (TR)
- It can be loaded using the privileged ltr instruction (CPL = 0)









- The *Base* field within the *n*-th core TSS register points to the *n*-th entry of the int\_tss array
- G=0 and Limit=0xeb
  - given that TSS is 236 bytes in size
- *DPL*=0
  - •TSS cannot be accessed in user mode





#### TSS on x64

| I/O Map Base Address  |        |
|-----------------------|--------|
| 17 O Map Base Address |        |
|                       |        |
| IST7 (                | high)  |
| IST7                  |        |
| IST6 (                |        |
| IST6                  |        |
| IST5 (                |        |
| ISTS                  |        |
| IST4 (                |        |
| IST4                  |        |
| IST3 (                |        |
| IST3                  |        |
| IST2 (                |        |
| IST2                  | (low)  |
| IST1 (                | (high) |
| IST1                  | (low)  |
|                       |        |
|                       |        |
|                       | (high) |
| RSP2                  |        |
| RSP1                  |        |
| RSP1                  |        |
|                       | (high) |
| RSP0                  | (low)  |
|                       |        |

- Registers are gone.
- The Interrupt Stack Table (IST) identifies 7 stack pointers to handle interrupts
- Entries in the IDT are modified to allow picking one of these stacks
- Value 0 tells the firmware not to use the IST mechanism





# Entering Ring 0 from Ring 3







# Protected Mode Paging

 Since 80386, x86 CPUs add an additional step in address translation

Memory Address Translation







# Protected Mode Paging

- Paging has to be explicitly enabled
  - Entering Protected Mode does not enable it automatically
  - Several data structures must be setup before

 Paging allows to manage memory protection at a smaller granularity than segmentation



# i386 Paging Scheme







# i386 Paging Scheme

- Both levels are based on 4 KB memory blocks
- Each block is an array of 4-byte entries
- Hence we can map 1K x 1K pages
- Since each page is 4 KB in size, we get a 4 GB virtual addressing space



#### i386 PDE entries

#### Page-Directory Entry (4-KByte Page Table)

| 31 |                                                                                                  | 12 | 11 9  | 8 | 7  | 6 | 5 | 4   | 3           | 2        | 1           | 0 |
|----|--------------------------------------------------------------------------------------------------|----|-------|---|----|---|---|-----|-------------|----------|-------------|---|
|    | Page-Table Base Address                                                                          |    | Avail | G | PS | 0 | А | PCD | P<br>W<br>T | U /<br>S | R<br>/<br>W | Р |
|    | Available for system programmer's use Global page (Ignored) ———————————————————————————————————— |    |       |   |    |   |   |     |             |          |             |   |



#### i386 PTE entries

#### Page-Table Entry (4-KByte Page)

| 31                                   |                                                                                                                                                                                              | 12 | 11    | 9 | 8 | 7           | 6 | 5 | 4   | 3           | 2           | 1           | 0 |
|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-------|---|---|-------------|---|---|-----|-------------|-------------|-------------|---|
|                                      | Page Base Address                                                                                                                                                                            |    | Avail | İ | G | P<br>A<br>T | D | Α | PCD | P<br>W<br>T | U<br>/<br>S | R<br>/<br>W | Р |
| G<br>P<br>D<br>A<br>C<br>W<br>U<br>R | Available for system programmer's use Blobal Page (TLB caching policy)  Page Table Attribute Index  Coccessed (Sticky bit)  Cache Disabled  Vrite-Through  Seer/Supervisor  Present  Present |    |       |   |   |             |   |   |     |             |             |             |   |





## Virtual to Physical Translation

**Memory Address Translation** 





## Paging Unit Operations

- Upon a TLB miss, firmware accesses the page table
- The first checked bit is PRESENT
- If this bit is zero, a page fault occurs which gives rise to a trap
- CPU registers (including EIP and CS) are saved on the system stack
- They will be restored when returning from trap: the trapped instruction is re-executed
- Re-execution might give rise to additional traps, depending on firmware checks on the page table
- As an example, the attempt to access a read only page in write mode will give rise to a trap (which triggers the **segmentation fault handler**)



#### Translation Lookaside Buffer







### Linux memory layout on i386







## Physical Address Extension (PAE)

- An attempt to extend over the 4GB limit on i386 systems
- Present since the Intel Pentium Pro
- Supported on Linux since kernel 2.6
- Addressing is extended to 36 bits
- This allows to drive up to 64 GB of RAM memory
- Paging uses 3 levels
- CR4.PAE-bit (bit 5) tells if PAE is enabled





## Physical Address Extension (PAE)







## x64 Paging Scheme

- PAE is extended via the so called "long addressing"
- 2<sup>64</sup> bytes of logical memory in theory
- Bits [49-64] are short-circuited
  - Up to 248 canonical form addresses (lower and upper half)
  - A total of 256 TB addressable memory
- Linux currently allows for 128 TB of logical addressing of individual processes and 64 TB for physical addressing





#### Canonical Addresses

64-bit



**48-bit** 





## Linux memory layout on x64







## 48-bit Page Table (4KB pages)





## CR3 and Paging Structure Entries

| 6 6<br>3 2  | 6 6 5 5 5 5 5 5 5 5 5<br>1 0 9 8 7 6 5 4 3 2 | 5<br>1 M <sup>1</sup> | M-1 3 3 3 3 2 2 1 0 9                            | 2                    | 2 1 1 1 1 1 1<br>0 9 8 7 6 5 4 | 1 1 1<br>3 2 1 | 1<br>0 9 8 7   | 65    | 4 3 2 1            | 0                           |                          |
|-------------|----------------------------------------------|-----------------------|--------------------------------------------------|----------------------|--------------------------------|----------------|----------------|-------|--------------------|-----------------------------|--------------------------|
|             | Reserved <sup>2</sup>                        |                       | Address of PML4 table                            |                      |                                | 100            | Ignore         |       | P<br>CW Ign<br>OT  | T                           | CR3                      |
| X<br>D<br>3 | Ignored                                      | Rsvd.                 | Address of page-directory-pointer table          |                      |                                |                | Ign. R         | g A ( | PUR<br>W/SW        | 1                           | PML4E:<br>present        |
| Ignored     |                                              |                       |                                                  |                      |                                |                |                |       |                    | 0                           | PML4E:<br>not<br>present |
| X<br>D      | Ignored                                      | Rsvd.                 | Address of 1GB page frame Reserved A T           |                      |                                | P<br>A<br>T    |                |       | DT <sup>1</sup> 3W | 1                           | PDPTE:<br>1GB<br>page    |
| X<br>D      | Ignored                                      | Rsvd.                 | Address of page directory Ign. D I A C W / S / W |                      |                                |                |                |       | 1                  | PDPTE:<br>page<br>directory |                          |
| Ignored     |                                              |                       |                                                  |                      |                                |                |                |       |                    | <u>0</u>                    | PDTPE:<br>not<br>present |
| X<br>D      | Ignored                                      | Rsvd.                 |                                                  | ress of<br>age frame | Reserved                       | P<br>A<br>T    | gn. G <u>1</u> | DA    | PUR<br>W/SW        | 1                           | PDE:<br>2MB<br>page      |
| X<br>D      | Ignored                                      | Rsvd.                 | Address of page table Ign. D I A C W /S W        |                      |                                |                |                |       | 1                  | PDE:<br>page<br>table       |                          |
|             |                                              |                       |                                                  |                      |                                |                |                |       |                    | 0                           | PDE:<br>not<br>present   |
| X<br>D      | Ignored                                      | Rsvd.                 | Address of 4KB page frame Ign. G P D A C W /S W  |                      |                                |                |                |       | 1                  | PTE:<br>4KB<br>page         |                          |
| Ignored     |                                              |                       |                                                  |                      |                                |                |                |       |                    | <u>0</u>                    | PTE:<br>not<br>present   |





## Huge Pages

- Ideally x64 processors support them starting from PDPT
- Linux typically offers the support for huge pages pointed by the PDE (page size 512\*4KB)
- See: /proc/meminfo and  $/proc/sys/vm/nr_hugepages$
- These can be mmap'ed via file descriptors and/or mmap parameters (e.g. MAP\_HUGETLB flag)
- They can also be requested via the madvise (void \*, size\_t, int) system call (with MADV\_HUGEPAGE flag)



## How to enable x64 longmode

- The first step is (of course) to setup a coherent page table
- We must then tell the CPU to enable Long Mode
- Refer to arch/x86/include/uapi/asm/msr-index.h for the definition of the symbols

```
movl $MSR_EFER, %ecx
rdmsr
btsl $_EFER_LME, %eax
wrmsr
pushl $__KERNEL_CS
leal startup_64(%ebp), %eax
pushl %eax
movl $(X86_CR0_PG | X86_CR0_PE), %eax
movl %eax, %cr0
lret
```





## **Boot Sequence**

**BIOS/UEFI** 

The actual Hardware Startup

Bootloader Stage 1

Executes the Stage 2 bootloader (skipped in case of UEFI)

Bootloader Stage 2

Loads and starts the Kernel

Kernel Startup

The Kernel takes control of and initializes the machine (machine-dependent operations)

Init

First process: basic environment initialization (e.g., SystemV Init, systemd)

Runlevels/Targets

Initializes the user environment (e.g., single-user mode, multiuser, graphical, ...)





## Second Stage Bootloader

- There are various versions of this software
  - In GRUB it is GRUB Stage 2
  - In Win NT it is c:\ntldr
- The second stage bootloader reads a configuration file, e.g. to startup a boot selection menu
  - grub.conf in GRUB, boot.ini in Win NT
- The kernel initial image is loaded in memory using BIOS disk I/O services
  - For Linux, it is /boot/vmlinuz-\*
  - For Win NT, it is c:\Windows\System32\ntoskrnl.exe



#### Historical Linux Bootcode

- The historical bootsector code for LINUX (i386) is in arch/i386/bootsect.s (no longer used)
- It loaded arch/i386/bootsetup.s and the kernel image in memory
- The code in arch/i386/bootsetup.s initialized the architecture (e.g. the CPU state for the actual kernel boot)
- It ultimately gave control to the initial kernel image





# Unified Extensible Firmware Interface (UEFI)

- Modular (you can extend it with drivers)
- Runs on various platforms
- It's written in C
- It supports a bytecode (portability to other architectures)

• It's completely different from BIOS





#### **UEFI** Boot

- UEFI boot manager takes control right after the system is powered on
- It looks at the boot configuration
- It loads the firmware settings into RAM from nvRAM
- Startup files are stored on a dedicated EFI System Partition (ESP)
  - It's a FAT32 partition
  - It has one folder for each OS on the system
- MBR cannot handle disks larger than 2TB





#### **UEFI** Boot

- It can automatically detect new uefi-boot targets
  - UEFI uses standard path names
    - /efi/boot/boot x64.efi
    - /efi/boot/bootaa64.efi
- UEFI programs can be easily written





#### **GUID Partition Table**





#### Secure Boot

- There is a kind of malware which takes control of the system before the OS starts
  - MBR RootKits
- Usually, these RootKits hijack the IDT for I/O operations, to execute their own wrapper
- When the kernel is being loaded, the RootKit notices that and patches the binary code while loading it into RAM





#### Secure Boot

- UEFI allows to load only signed executables
- Keys to verify signatures are installed in UEFI configuration
  - Platform Keys (PK): tells who "owns and controls" the hardware platform
  - Key-Exchange Keys (KEK): shows who is allowed to update the hardware platform
  - Signature Database Keys (DB): show who is allowed to boot the platform in secure mode





## Dealing with multicores

- Who shall execute the startup code?
- For legacy reasons, the code is purely sequential
- Only one CPU core (the master) should run the code

- At startup, only one core is active, the others are in an idle state
- The startup procedure has to wake up other cores during kernel startup





### Interrupts on Multicore Architectures

- The Advanced Programmable Interrupt Controller (APIC) is used for sophisticated interrupt sending/redirection
- Each core has a Local APIC (LAPIC) controller, which can send Inter-Processor Interrupts (IPIs)
  - LAPICs are connected through the (logical) "APIC Bus"
  - LINT 0 : normal interrupts LINT 1 : Non-maskable Interrupts
- I/O APICs contain a redirection table, which is used to route the interrupts it receives from peripheral buses to one or more local APICs





#### **LAPIC**







## Interrupt Control Register

- The ICR register is used to initiate an IPI
- Values written into it specify the type of interrupt to be sent, and the target core

ICR (upper 32-bits)



The Destination Field (8-bits) can be used to specify which processor (or group of processors) will receive the message

Memory-Mapped Register-Address: 0xFEE00310

#### ICR (lower 32-bits)







## Broadcast INIT-SIPI-SIPI Sequence

```
# address Local-APIC via register FS
              $sel fs, %ax
    mov
             %ax, %fs
    mov
# broadcast 'INIT' IPI to 'all-except-self'
              $0x000C4500, %eax; 11 00 0 1 0 0 0 101 00000000
    mov
    mov %eax, %fs:(0xFEE00300)
.B0: btl $12, %fs:(0xFEE00300)
    jС
             .B0
# broadcast 'Startup' IPI to 'all-except-self'
# using vector 0x11 to specify entry-point
# at real memory-address 0x00011000
              $0x000C4611, %eax ; 11 00 0 0 1 0 0 0 110 00010001
    mov
              %eax, %fs:(0xFEE00300)
    mov
.B1: btl $12, %fs:(0xFEE00300)
    ic .B1
```

