- Intro to course
- HPC Hardware and Software
- HPC Software Stack
- Code Optimization Fundamentals
- Modern CPU Architecture
- Practical Code Optimization
- Memory Management and Execution Model
- Cache Optimization and Data Locality
- Loop Optimization and Vectorization
- Advanced Loop Techniques and Prefetching
- Introduction to Parallel Computing
- OpenMP Fundamentals
- OpenMP Parallel Regions
- Parallel Algorithm Analysis
- OpenMP Parallel Loops
- OpenMP NUMA Awareness
- MPI Fundamentals
- MPI Point-to-Point Communication
- MPI Collective Communication
- HPC Profiling and Analysis
- HPC Debugging Techniques
- Performance Monitoring and Hardware Counters
- Master's index
- It will be entirely in English, slides made by the professor on which I will take notes lesson after lesson.
- The final exam will consist of a project assigned by the professor and a oral exam.
- This exam will be the second of the two modules that make up the entire course, the second is called "Introduction to Cloud Computning". You can find it on my profile.
The professor have is own repository of the course here
Here the repository with my final project here
Core HPC fundamentals and architecture
- What is HPC and why it matters for modern computational challenges
- CPU architecture evolution: from single-core to multi-core, vectorization, and memory hierarchies
- Performance optimization: compiler techniques, loop optimization, cache-aware programming
- Memory systems: virtual memory, cache optimization, NUMA architectures
Parallel programming mastery
- Shared-memory programming with OpenMP: threads, synchronization, load balancing
- Distributed-memory programming with MPI: message passing, collective operations
- Parallel algorithm design and complexity analysis
- Performance debugging and profiling with modern tools
HPC ecosystem and infrastructure
- Hardware components: clusters, interconnects, storage systems
- Software stack: compilers, libraries, resource management (SLURM)
- Performance analysis: hardware counters, profiling tools (perf, valgrind, gdb)
- Real-world applications: scientific computing, AI/ML, weather forecasting
Professional HPC skills
- Performance measurement and benchmarking methodologies
- Debugging parallel applications and avoiding common pitfalls
- Code optimization strategies for maximum computational efficiency
- Understanding the complete HPC software and hardware stack
Attitude
- Don't be (only) a user of pre-cooked tools that you consider as black-boxes
- Every question is legitimate and useful, ask what you do not understand
- Main pourpose it to learn, not to grade
- Learning is a process, not a result
- Nobody is perfect or always right: errors and mistake are natural
- Learning is a process in our personal brain, not in other's one. Clash with your limits before check the solution
Introduction to High Performance Computing covering fundamental concepts, performance metrics, and real-world applications.
Key topics:
- Why HPC is essential for complex computational problems
- HPC applications: weather forecasting, protein folding (AlphaFold), AI training
- Performance metrics: FLOPS, TOP500 rankings, benchmarking
- HPC ecosystem: hardware, software, and human resources
- Current exascale systems and performance vs productivity considerations
Slides are available here
Hardware and software foundations of HPC systems, explaining why HPC is inherently parallel and the evolution from serial to parallel computing.
Key topics:
- End of Dennard scaling and Moore's Law implications
- Von Neumann architecture limitations and Flynn's Taxonomy
- Multicore CPUs, hardware accelerators (GPUs, FPGAs, ASICs)
- HPC cluster components: nodes, interconnects, storage
- Memory architectures: UMA vs NUMA, cache hierarchies
- Memory wall problem and parallelism challenges
Slides are available here
Overview of the complete software stack required for HPC systems, from system management to user applications, with focus on resource management and scheduling.
Key topics:
- Cluster middleware: administration and resource management software
- Local Resource Management Systems (LRMS): SLURM, LSF, PBS
- Batch schedulers: job allocation, scheduling policies, resource optimization
- Scientific software management: modules system, version control
- Compiler suites: GNU, Intel, PGI - optimization vs portability trade-offs
- Library types: static vs shared libraries, dependency management
Slides are available here
Introduction to code optimization principles and compiler techniques for single-core performance improvement in HPC applications.
Key topics:
- Optimization constraints: time, memory, energy, I/O, robustness considerations
- Clean code principles: DRY (Don't Repeat Yourself), readability, maintainability
- Design methodology: algorithm selection, data structures, testing, validation
- Compiler optimization levels (-O1, -O2, -O3, -Os), architecture-specific compilation (-march=native)
- Profile-guided optimization (PGO) and automatic profiling techniques
- Memory aliasing problems and the 'restrict' qualifier in C
- Storage classes and variable qualifiers (const, volatile, register)
Slides are available here
Detailed exploration of modern single-core CPU architecture evolution, from simple Von Neumann model to complex superscalar, out-of-order, vectorized processors.
Key topics:
- Memory wall problem and cache hierarchy (L1, L2, L3) solutions
- Locality principles: temporal and spatial locality for cache efficiency
- Cache types: compulsory, capacity, conflict misses and optimization strategies
- CPU pipeline evolution: fetch, decode, execute, writeback stages
- Superscalar architecture: multiple execution ports and instruction-level parallelism (ILP)
- Vector processing: SIMD capabilities, vector registers (AVX, SSE)
- Power wall challenges: static vs dynamic power consumption, thermal design
- Cache coherency problems in multi-socket systems
Slides are available here
Hands-on techniques for code optimization with concrete examples, focusing on loop optimization and common performance pitfalls to avoid.
Key topics:
- Avoiding expensive function calls: sqrt(), pow(), floating-point division optimizations
- Loop hoisting and expression pre-computation for nested loops
- Variable scope optimization and register usage suggestions
- Floating-point arithmetic limitations: non-associative operations and compiler constraints
- Eliminating redundant checks and unnecessary memory references
- Local accumulator patterns for reduction operations
- Fast-math compiler flags and IEEE compliance trade-offs
Slides are available here
Deep dive into program execution model, memory organization, and the fundamental differences between stack and heap allocation in Unix systems.
Key topics:
- Virtual memory model: address space, paging mechanism, Translation Lookaside Buffer (TLB)
- Stack vs heap: local vs global variables, scope limitations, growth directions
- Stack frame anatomy: function calls, return addresses, local variables, BP/SP registers
- Dynamic allocation: alloca() for stack allocation, stack size limits (ulimit)
- Assembly-level memory access patterns: stack vs heap performance comparison
- Memory addressing modes and compiler optimization differences
- Variable scope and lifetime implications for performance
Slides are available here
Advanced techniques for optimizing cache performance through memory access patterns, data organization, and locality-aware programming strategies.
Key topics:
- Cache mapping strategies: direct, associative, n-way associative with set organization
- Memory access patterns: strided access, matrix transpose optimization, cache conflicts
- Write policies: write-through vs write-back, write-allocate strategies
- Cache-associativity conflicts and padding techniques for cache resonance mitigation
- Data organization: blocking techniques, hot vs cold field separation
- Space-filling curves: Z-order (Morton), Hilbert curves for locality preservation
- Memory mountain analysis and cache size detection techniques
Slides are available here
Advanced loop optimization techniques focusing on vectorization, compiler directives, and exploiting SIMD capabilities for maximum performance.
Key topics:
- Loop vectorization fundamentals: data dependencies, vectorizable patterns
- SIMD instruction sets: SSE, AVX, AVX-512 capabilities and register usage
- Compiler auto-vectorization: enabling flags, optimization reports, vectorization hints
- Manual vectorization with intrinsics: Intel/GCC intrinsics for fine-grained control
- Loop transformations: unrolling, peeling, fusion, distribution, interchange
- Data alignment requirements and performance implications
- Vectorization obstacles: complex control flow, function calls, memory aliasing
Slides are available here
Sophisticated loop optimization strategies with focus on arithmetic intensity classification, multiple accumulator techniques, and hardware/software prefetching methods.
Key topics:
- Loop classification by arithmetic intensity: O(N)/O(N), O(N²)/O(N²), O(N³)/O(N²)
- Advanced loop unrolling strategies: n×m unrolling patterns, multiple accumulators
- Critical path analysis and dependency chain optimization
- Matrix multiplication optimization: loop tiling, blocking algorithms, cache-aware implementations
- Hardware and software prefetching: explicit prefetch directives, preloading techniques
- Register spill management and code bloating prevention
- Performance measurement and analysis: IPC, cycles per element, cache miss patterns
Slides are available here
Comprehensive introduction to parallel computing fundamentals, HPC architectures, memory models, and performance analysis in high-performance computing environments.
Key topics:
- HPC rationale: complex problem solving, simulation-based research, data processing challenges
- Parallelism classification: embarrassingly parallel vs interdependent problems
- Domain decomposition strategies: data distribution, load balancing, work imbalance management
- Shared vs distributed memory paradigms: UMA, NUMA architectures, programming models
- HPC hardware hierarchy: cores, sockets, nodes, clusters, interconnection networks
- Cache coherency protocols: MESI standard, false sharing, data consistency
- Parallel performance theory: Amdahl's law, Gustafson's law, strong vs weak scalability
- MPI and OpenMP programming paradigms introduction
Slides are available here
Introduction to OpenMP shared-memory programming model, covering basic concepts, thread management, and the fork-join execution paradigm.
Key topics:
- Processes vs threads: memory sharing, resource allocation, creation overhead
- OpenMP vs MPI: shared-memory vs distributed-memory programming paradigms
- Fork-join execution model: master thread spawning and synchronization
- OpenMP evolution: from basic loop parallelization to tasks and accelerator offloading
- Directive-based approach advantages: abstraction, efficiency, incremental development
- Static vs dynamic extent: lexical scope and orphan directives
- Conditional compilation and portability considerations
Slides are available here
Comprehensive coverage of OpenMP parallel regions, thread creation, memory management, and synchronization constructs for shared-memory programming.
Key topics:
- Thread creation and stack management: virtual memory layout, stack sizing
- Variable scope management: shared, private, firstprivate, lastprivate, threadprivate
- Thread specialization constructs: critical, atomic, single, master/masked directives
- Synchronization mechanisms: barriers, ordered execution, race condition prevention
- Work assignment strategies: manual thread ID-based distribution, conditional regions
- Nested parallelism: multi-level thread creation, environment variable control
- OpenMP runtime functions and internal control variables (ICVs)
Slides are available here
Analysis of parallel algorithms design and complexity, focusing on prefix sum implementations to demonstrate algorithmic efficiency differences in parallel computing.
Key topics:
- Prefix sum algorithm fundamentals: serial vs parallel approaches
- Algorithm I: recursive subdivision with n-1 steps, O(N(n+1)/2) complexity
- Algorithm II: binary tree approach with log₂(n) steps, O(N(1+log₂n)/2) complexity
- Scalability comparison: speedup analysis and performance characteristics
- Memory access patterns and their impact on parallel algorithm efficiency
- Work complexity vs span complexity in parallel algorithm design
- Trade-offs between algorithmic approaches and practical implementation considerations
Slides are available here
Advanced OpenMP loop parallelization techniques, work scheduling strategies, and common pitfalls in shared-memory parallel programming.
Key topics:
- Parallel for construct: basic syntax, variable scoping (shared, private, firstprivate, lastprivate)
- Work scheduling policies: static, dynamic, guided scheduling with chunk size control
- Reduction operations: avoiding race conditions, false sharing problems
- Data race anatomy: memory access conflicts, synchronization requirements
- False sharing: cache line conflicts, performance degradation, mitigation strategies
- Loop clauses: collapse, nowait, ordered execution control
- Nested parallelism within loops: parallel regions vs parallel for differences
Slides are available here
Advanced OpenMP thread affinity and memory allocation strategies for NUMA (Non-Uniform Memory Access) systems optimization.
Key topics:
- NUMA architecture challenges: memory bandwidth, latency variations, remote access costs
- Thread affinity concepts: places (threads, cores, sockets), binding policies (close, spread, master)
- Thread placement strategies: hardware thread mapping, SMT considerations
- OMP_PLACES and OMP_PROC_BIND environment variables configuration
- Memory allocation policies: touch-first vs touch-by-all strategies
- Data locality optimization: malloc vs calloc behavior, memory page mapping
- Performance impact analysis: bandwidth aggregation, cache utilization, synchronization costs
- Topology discovery tools: numactl, lstopo, hwloc utilities
Slides are available here
Introduction to Message Passing Interface (MPI) for distributed-memory parallel programming, covering basic concepts and communication patterns.
Key topics:
- Distributed-memory paradigm: separate address spaces, explicit message passing
- MPI as a standard: library specification, multiple implementations (OpenMPI, MPICH)
- Basic MPI structure: initialization, finalization, communicators, ranks
- Communicators and groups: MPI_COMM_WORLD, context separation, best practices
- Thread support levels: MPI_THREAD_SINGLE to MPI_THREAD_MULTIPLE
- Process identification: rank determination, size queries, communication contexts
- MPI analogy: postal service model for understanding message passing concepts
Slides are available here
Comprehensive coverage of MPI point-to-point communication patterns, including blocking, non-blocking, and specialized communication modes for distributed-memory programming.
Key topics:
- Basic send/receive operations: MPI_Send, MPI_Recv syntax and semantics
- MPI data types: primitive types mapping, message envelope components
- Communication safety: deadlock prevention, unsafe code patterns, message ordering
- Communication modes: standard, synchronous (Ssend), buffered (Bsend), ready (Rsend)
- Communication protocols: eager vs rendezvous protocols, system buffering implications
- Non-blocking communications: MPI_Isend/Irecv, request handling, MPI_Test/Wait operations
- Advanced patterns: MPI_Sendrecv, probe operations, ping-pong performance testing
- Practical exercises: latency/bandwidth measurement, safe communication design
Slides are available here
Advanced MPI collective communication operations for efficient data distribution, synchronization, and computation across process groups in distributed systems.
Key topics:
- Collective operation categories: synchronization, data movement, collective computation
- Synchronization primitives: MPI_Barrier implementation and performance implications
- Data distribution patterns: broadcast, scatter/gather, allgather, alltoall operations
- Variable-size collectives: MPI_Scatterv, MPI_Gatherv for irregular data distribution
- Collective computation operations: reduce, allreduce, scan operations with predefined operators
- Algorithm complexity: tree-based vs linear algorithms, hardware-accelerated collectives
- Groups and communicators: creating custom communicators, MPI_Comm_split operations
- In-place operations: MPI_IN_PLACE usage for memory-efficient collective operations
Slides are available here
Comprehensive guide to HPC application profiling and performance analysis, covering both traditional and modern profiling tools and methodologies.
Key topics:
- Profiling fundamentals: call tree/graph generation, instrumentation vs sampling approaches
- Dynamic profiling concepts: context-sensitive analysis, calling context trees
- Classical tools comparison: gprof limitations, google perftools, valgrind capabilities
- Advanced profiling tools: perf, cachegrind, callgrind for detailed performance analysis
- Call tree visualization: gprof2dot, kcachegrind GUI tools for performance data interpretation
- Memory profiling: heap analysis, leak detection, memory access pattern optimization
- Hardware counter interfaces: PAPI, PMU access, vendor-specific profiling tools
- Profiler accuracy and limitations: avoiding measurement artifacts and interpretation pitfalls
Slides are available here
Advanced debugging strategies for parallel HPC applications, covering both serial and parallel debugging approaches with modern tools and methodologies.
Key topics:
- GDB fundamentals: compilation flags, breakpoints, memory examination, stack analysis
- Multi-threaded debugging: thread management, scheduler control, parallel execution analysis
- MPI debugging strategies: process attachment, synchronization debugging, deadlock detection
- Advanced debugging workflows: core file analysis, running process inspection, reverse debugging
- Debugging parallel applications: coordinated debugging sessions, process synchronization techniques
- GDB extensions and interfaces: TUI mode, dashboard customization, GUI frontends
- Memory debugging: valgrind integration, memory error detection, leak analysis
- Practical debugging patterns: safe debugging code injection, environment-based activation
Slides are available here
Deep dive into Performance Monitoring Units (PMUs) and hardware-based performance analysis using modern Linux tools for detailed HPC performance characterization.
Key topics:
- Performance Monitoring Units (PMUs): architecture, event selection registers, performance counters
- Hardware counter types: fixed-function counters, programmable counters, architectural vs model-specific events
- Linux perf framework: kernel infrastructure, user-space tools, event abstraction layer
- Perf usage patterns: counting mode vs sampling mode, system-wide vs process-specific monitoring
- Event selection and configuration: mnemonic names, raw event codes, event multiplexing
- Advanced perf features: call-graph recording, flame graph generation, multi-event profiling
- Alternative tools integration: PAPI library fundamentals, gperftools integration
- Performance analysis workflow: data collection strategies, bottleneck identification, optimization guidance
Slides are available here