

# UniTs - University of Trieste

Faculty of Scientific and Data Intensive Computing
Department of mathematics informatics and geosciences

# High Performance Computing

Lecturer: Prof. Stefano Cozzini

Authors:

Andrea Spinelli Christian Faccio

March 7, 2025

This document is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA) license. You may share and adapt this material, provided you give appropriate credit, do not use it for commercial purposes, and distribute your contributions under the same license.

# **Abstract**

As a student of Scientific and Data Intensive Computing, I've created these notes while attending the **High Performance Computing** module of **High Performance and Cloude Computing** course. The course will introduce the fundamentals of High Performance Computing, exploring both its concepts and practical applications. The notes cover a wide range of topics, including:

- An overview of High Performance Computing and its importance in solving complex, real-world problems.
- The principles behind modern computer architectures and how they influence performance.
- Essential tools and techniques for parallel programming, alongside strategies to optimize code for advanced architectures.
- The evolution of computing facilities and how to effectively leverage them for large-scale computational challenges.
- Developing a proactive mindset, moving beyond the use of pre-packaged tools to a deeper understanding of the underlying systems.

This comprehensive approach not only equips you with technical skills but also fosters a critical perspective on technological innovation in computing.

While these notes were primarily created for my personal study, they may serve as a valuable resource for fellow students and professionals interested in High Performance Computing.

# Contents

| 1 | Introduction                              |  |  |  |
|---|-------------------------------------------|--|--|--|
|   | 1 Base Concepts                           |  |  |  |
|   | 1.1.1 What is High Performance Computing? |  |  |  |
|   | 1.1.2 Performance and metrics             |  |  |  |
| 2 | ardware and Software                      |  |  |  |
|   | 1 Moore's Law                             |  |  |  |
|   | 2 Parallel Computers                      |  |  |  |

Contents

# 1.1 Base Concepts

High Performance Computing (HPC), also known as supercomputing, refers to computing systems with extremely high computational power that are able to solve hugely complex and demanding problems. [europaHighPerformance]

Often, high precision and accuracy are required in scientific and engineering simulations, which can be achieved by increasing the computational power of the system. This is where HPC comes into play, as it allows for the execution of large-scale **simulations** of complex problems in a reasonable amount of time. Simulations have become the key method for researching and developing innovative solutions in both scientific and engineering fields. They are especially prominent in leading domains such as the aerospace industry and astrophysics, where they enable the investigation and resolution of highly complex problems. However, the increasing reliance on simulation also introduces significant challenges related to complexity, scalability, and data management, which in turn impact the supporting IT infrastructure.

As scientific inquiry progresses along what is known as the Inference Spiral of System Science, the complexity of models intensifies and the influx of new data enriches these systems with additional insights. Consequently, this dynamic evolution necessitates ever increasing computational power to efficiently handle the enhanced simulations and data management challenges.



Figure 1.1: Research and Development

#### **Observation**:

In today's world, larger and larger amounts of data are constantly being generated, from 33 zettabytes globally in 2018 to an expected 181 zettabytes in 2025. This exponential growth is driving a shift towards data-intensive applications, making HPC indispensable for processing and analyzing these vast datasets efficiently. Consequently, HPC is key to unlocking valuable insights that benefit citizens, businesses, researchers, and public

#### administrations. [europaHighPerformance]

#### 1.1.1 What is High Performance Computing?

High Performance Computing (HPC) involves using powerful servers, clusters, and supercomputers, along with specialized software, tools, components, storage, and services, to solve computationally intensive scientific, engineering, or analytical tasks.

HPC is used by scientists and engineers both in research and in production across industry, government and academia.

Key elements of the HPC ecosystem include:

- Hardware: High-performance servers, clusters, and supercomputers.
- Software: Specialized tools and applications designed to optimize complex computations.
- **Applications:** Scientific, engineering, and analytical tasks that leverage high computational power.

#### People in HPC

Human capital is by far the most important aspect in the HPC landscape. Two crucial roles include HPC providers, who plan, install, and manage the resources, and HPC users, who leverage these resources to their fullest potential. The mixing and interplaying of these roles not only enhances individual competence but also drives overall advancements in high-performance computing.

#### 1.1.2 Performance and metrics

**Performance** in the realm of high-performance computing is a multifaceted concept that extends far beyond a mere measure of speed. While terms such as "how fast" something operates are often used to describe performance, they tend to be vague. Many factors contribute to the overall performance of a system, and the interpretation of these factors can vary depending on the specific context and objectives of the computational task. Performance, therefore, remains a complex and central issue in the field of HPC, as it involves more than just the raw computational speed.

The discussion often extends to the idea that the "P" in HPC might stand for more than just performance. A growing sentiment among professionals in the field suggests that high performance should be complemented by high productivity. This broader view recognizes that the true efficiency of a computing system is not only determined by its ability to perform tasks quickly but also by the ease and speed with which applications can be developed and maintained. In other words, while raw performance is critical, the overall productivity of a system—combining the system's speed with the programmer's effort—plays an equally important role.

To further clarify the distinction, consider that performance can be seen as a measure of how effectively a system executes tasks, whereas productivity is the outcome achieved relative to the effort invested in developing the application. For instance, if a code optimization leads to a system that runs twice as fast but requires an extensive period of development—say, six months of work—the benefits of the improvement must be weighed against the increased effort required. This example underlines the importance of balancing performance gains with the associated development costs

Ultimately, the challenge lies in understanding and optimizing both aspects. A successful HPC system is one that not only achieves high computational throughput but also enhances the productivity of the developers who create and refine the applications. This balance is essential for advancing the capabilities of high-performance computing in both research and production environments.

#### Number Crunching on CPU

When evaluating the performance of a high-performance computing (HPC) system, one of the most fundamental metrics is the rate at which floating point operations are executed. This rate is typically expressed in millions (Mflops) or billions (Gflops) of operations per second. In essence, it quantifies how many calculations, such as additions and multiplications, the system is capable of performing every second.

To estimate this capability, we rely on the concept of theoretical peak performance. This value is computed by considering the system's clock rate, the number of floating point operations that can be executed in a single clock cycle, and the total number of processing cores available. Under ideal conditions, the theoretical peak performance can be expressed as follows:

$$FLOPS = clock rate \times Number of FP operations \times Number of cores$$

This formula provides an upper bound on the computational power of the system. However, it is important to note that this is a best-case scenario estimate and does not always reflect the performance achievable in real applications.

#### Sustained (Peak) Performance

While the theoretical peak performance offers insight into the maximum potential of an HPC system, the actual performance observed during real-world operations is better captured by the sustained (or peak) performance. In practice, several factors such as memory bandwidth limitations, communication latencies, and input/output overhead can prevent a system from reaching its theoretical maximum.

Sustained performance refers to the effective throughput that an HPC system attains when executing actual workloads. Since it is challenging to exactly measure the number of floating point operations performed by every application, standardized benchmarks are commonly used to assess this performance. One widely recognized benchmark is the HPL Linpack test, which forms the basis for the TOP500 list of supercomputers. This benchmark emphasizes the importance of sustained performance, as it reflects the system's efficiency and reliability under realistic operational conditions.

Understanding both the theoretical and sustained performance metrics is crucial. While the former provides an idealized estimate of a system's capabilities, the latter offers a more practical perspective, thereby guiding decisions on system improvements and resource allocation in high-performance computing environments.

# Hardware and Software

In the classical Von Neumann architecture there is only one processing unit (CPU) that processes instructions, the ALU is responsible for math and logic operations and the register stores data.



Figure 2.1: Von Neumann architecture

One instruction is executed at a time, the CPU fetches the instruction from the memory, decodes it and executes it. The CPU can access the memory to read or write data. Accessing any location in the memory has always the same cost, this is called the **uniform memory access** (UMA).

#### 2.1 Moore's Law

**Definition**: *Moore's Law* 

It states that the number of transistors in a dense integrated circuit doubles about every two years.

How can we go from Moore's Law to processor performance? Through Dennard Scaling: "Power density stays constant as transistors get smaller"

Intuitively,

- Smaller transistors  $\rightarrow$  shorter propagation delay  $\rightarrow$  faster frequency
- Smaller transistors  $\rightarrow$  smaller capacitance  $\rightarrow$  lower voltage

 $Power \propto Capacitance \times Voltage^2 \times Frequency$ 

**But...** even with smaller transistors, we cannot continue reducting power, there are then two options:



Figure 2.2: Multicore processors

- Increase power
- Stop frequency scaling

From 2006, single-core performance stopped increasing, so the only way to increase performance is to use more cores. The first solution is to write efficient software to make the efficient use of hardware resources. The second is to use specialized architectural solutions, like GPUs, FPGAs, etc.

Today, **CPUs are multicore processors**, lowering clock frequency because of power and heat dissipation, but packing more computing cores onto a chip. These cores will share some resources (memory, network, disk, ...) but are still capable of independent calculations.

### 2.2 Parallel Computers

Flynn Taxonomy (1966) is used to classify parallel computers based on the number of instruction streams and data streams.



Figure 2.3: Flynn Taxonomy

|      | HW level                                                                                                            | SW level                                                                                                                  |
|------|---------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| SISD | A Von Neumann CPU                                                                                                   | no parallelism at all                                                                                                     |
| MISD | On a superscalar CPU, different ports executing different <i>read</i> on the same data                              | <ul><li>ILP on same data;</li><li>Multiple tasks or threads operating on the same data</li></ul>                          |
| SIMD | Any vector-capable hardware, the vector registers on a core, a GPU, a vector processor, an FPGA,                    | data parallelism through vector instructions and operations                                                               |
| MIMD | Every multi-core/processor system; on a superscalar CPUs, different ports executing different ops on different data | <ul> <li>ILP on different data;</li> <li>Multiple tasks or threads executing different code on different data.</li> </ul> |

Figure 2.4: Parallel computers

The essential components of a HPC cluster are:

- Compute nodes: the actual computers that perform the calculations
- **Interconnect**: the network that connects the compute nodes
- Storage: the disk space where data is stored
- **Software**: the operating system and the software stack that runs on the cluster



Figure 2.5: HPC cluster

A core is the smallest unit of computing, having one or more threads and is responsible for executing instructions.



Figure 2.6: Core



Figure 2.7: Cache hierarchy

Cache hierarchy can have different topologies.



Figure 2.8: Node topology



Cache hierarchy has been invented cause accessing memory (moving data) takes almost 100x times computing (doing operations). The cache is a small memory that stores the most frequently accessed data. The cache is faster than the main memory but smaller. The cache is divided into levels, the first level is the fastest but the smallest, the second level is slower but bigger, and so on.

In a cluster there are different types of networks:

- **High-speed network**: used for communication between nodes (parallel computation, low latency/high bandwidth, infiniband or ethernet as examples)
- I/O NETWORK: used for communication with storage (I/O requests, latency not fundamental/good bandwidth, NFS, Lustre, GPFS as examples)
- In band Management Network: used for cluster management, monitoring, and control, LRMS (Load Resource Management System) as examples
- Out of band Management Network: used for cluster management, remote control of nodes and any other device, IPMI (Intelligent Platform Management Interface) as examples

#### What about **memory**?

It is fundamental and on a supercomputer there is a hybrid approach as for the memory placement. The memory on a single node can be accessed directly by all the cores on that node (**shared-memory**). When many nodes are used at a time, a process cannot directly access the memory on a different node, it needs to issue a request for that. That is named **distributed-memory**.

#### **Shared Memory**

• Uniform Memory Access (UMA):
Each processor has uniform access to memory. Also known as symmetric multiprocessors (SMP).



Figure 2.9: Shared memory - UMA

• Non-Uniform Memory Access (NUMA): Time for memory access depends on location of data. Local access is faster on non-local access.

#### **A** Warning: Challenges for multicore

It aggravates the **Memory Wall problem**, where the memory access time is much slower than the CPU speed.

- Memory bandwidth: the memory bandwidth is limited and shared among all the cores
- Memory latency: the memory latency is high and can be a bottleneck
- **Memory contention**: the memory contention can be a problem when multiple cores access the same memory location



Figure 2.10: Parallelism within a HPC node

- 1. ILP/SIMD units
- 2. Cores
- 3.
- 4. Socket/ccNuma domains
- 5. Inner cache levels
- 6. Multiple accelerators