

## **System** architecture

Jakub Yaghob Martin Kruliš





### Terminology – 1

- Distributed systems
  - Host/node/system
    - One computer
  - Cluster
    - A set of nodes connected together by a network
    - Usually homogenous
  - Grid
    - A set of clusters connected by internet



### Terminology – 2

- One system
  - Socket/package
    - Physical CPU
    - Multiple sockets connected together by a high-speed connection
    - Cache hierarchy
    - Cores
  - NUMA node
    - Main memory
    - Can contain processing units
  - Core
    - Owns execution units
    - Shared registers
    - Contains logical CPUs
  - Logical CPU/thread
    - Processing unit
    - Executes instructions
    - Private registers
    - Hyper-threading



### NUMA system – 4S





## NUMA system – 8S



# NUMA – physical memory layout



#### **Block**

NODE0

NODE1

NODE2

NODE3

#### Interleaving



# Simplified package architecture











L3/LLC

Package



### **Cache terminology**

- Cache line
  - Data transferred between memory and cache in atomic blocks
    - 64B
- Cache hit
  - Data load/store from/to a cache
- Cache line load
  - Cache line read from main memory
- Cache line flush
  - Cache line stored to main memory
- Cache miss
  - A cache line is selected for eviction
  - If it is modified, cache line will be flushed
  - The cache line is loaded
- False sharing
  - Private data of different threads in the same cache line



### Cache coherency

- Coherency inside the package
  - Inclusive x exclusive caches
- Coherency between packages
  - ccNUMA
  - MESI protocol
    - Modified, Exclusive, Shared, and Invalid
    - Snooping
- Cache line ping-pong
  - Moving cache line among caches/packages in rapid succession



### **MESI** protocol





### Latencies

| Action                       | Cycles  | Time (3GHz) |
|------------------------------|---------|-------------|
| Local L1 cache hit           | 4       | 1-2 ns      |
| Local L2 cache hit           | 14      | 4-6 ns      |
| Branch misprediction         | 16      | 5-6 ns      |
| Local L3 cache hit           | 40-75   | 12-40 ns    |
| Mutex lock/unlock            |         | 75 ns       |
| Remote L3                    | 100-300 | 30-100 ns   |
| Local memory                 |         | 100 ns      |
| Remote memory                |         | 100-300 ns  |
| Send 1KB over FDR InfiniBand |         | 900 ns      |
| Send 1KB over 1GB Ethernet   |         | 20000 ns    |



### Package schema





### Core schema

