# Performance Optimization in HPC

## Introduction

In this notebook, we will explore the fundamental techniques for optimizing code performance in High-Performance Computing (HPC) environments. Performance optimization is crucial for fully exploiting the capabilities of HPC architectures. By understanding and applying these techniques, you can significantly reduce the runtime of your computational tasks, making them more efficient and scalable.

This practice is essential in HPC as it allows for better resource utilization, reduced costs, and the ability to solve larger and more complex problems. We will cover various optimization strategies, including code profiling, memory hierarchy optimization, and the use of high-performance libraries.



## 2. Optimizing Code for HPC Architectures

### 2.1 Code Profiling and Analysis

Before optimizing any code, it's essential to understand where the bottlenecks are. Profiling tools help identify the most time-consuming parts of your code, which are the primary candidates for optimization.

### 2.2 Loop Unrolling and Vectorization

Loop unrolling and vectorization are common techniques used to enhance the performance of loops, which are often the most time-consuming parts of computational code.

### 2.3 Memory Access Patterns and Cache Utilization

Efficient memory access patterns and effective use of the CPU cache can dramatically speed up your programs.


In [1]:
# Example: Profiling a simple matrix multiplication function using cProfile

import cProfile
import numpy as np

def matrix_multiply(A, B):
    return np.dot(A, B)

# Create large random matrices
A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

# Profile the matrix multiplication
cProfile.run('matrix_multiply(A, B)')

# The output will show where the time is being spent in the function


         5 function calls in 0.107 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.105    0.105    0.105    0.105 <ipython-input-1-cda063ce1d20>:6(matrix_multiply)
        1    0.001    0.001    0.107    0.107 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 multiarray.py:741(dot)
        1    0.000    0.000    0.107    0.107 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}




### Explanation:

The above code uses Python's `cProfile` to profile a matrix multiplication function. Profiling helps identify the parts of the code that consume the most computational resources, allowing us to focus our optimization efforts effectively.


## 3. Memory Hierarchy and Data Locality

### 3.1 Understanding Memory Hierarchy

Memory hierarchy, from registers to cache and RAM, plays a critical role in the performance of HPC applications. Optimizing for memory hierarchy can significantly reduce data access times.

### 3.2 Data Locality

Data locality refers to the use of data elements within close proximity in memory, reducing cache misses and improving overall performance.


In [2]:
# Example: Measuring the impact of data locality on performance

import time

def sum_rows(matrix):
    total = 0
    for row in matrix:
        total += sum(row)
    return total

def sum_columns(matrix):
    total = 0
    for col in range(matrix.shape[1]):
        total += sum(matrix[:, col])
    return total

# Create a large matrix
matrix = np.random.rand(10000, 10000)

# Measure row-wise sum performance
start_time = time.time()
sum_rows(matrix)
print("Row-wise sum time:", time.time() - start_time)

# Measure column-wise sum performance
start_time = time.time()
sum_columns(matrix)
print("Column-wise sum time:", time.time() - start_time)


Row-wise sum time: 11.047656297683716
Column-wise sum time: 13.172894954681396


### Explanation:

In the above example, we measure the performance impact of accessing matrix elements row-wise versus column-wise. Due to the way memory is structured, row-wise access is typically faster because it accesses contiguous memory locations, which is more cache-friendly.


## 4. High-Performance Libraries for Scientific Computing

Leveraging high-performance libraries can save development time and ensure that your code is optimized for modern HPC architectures.

### 4.1 Using BLAS and LAPACK for Linear Algebra

BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra Package) are standard libraries that provide optimized implementations of basic linear algebra routines.


In [3]:
# Example: Using BLAS via NumPy for optimized matrix multiplication

from scipy.linalg import blas

# Using BLAS dgemm for matrix multiplication
C = blas.dgemm(1.0, A, B)

print("Resulting matrix shape:", C.shape)


Resulting matrix shape: (1000, 1000)


### Explanation:

Here, we use the `dgemm` function from BLAS, accessed via SciPy, to perform matrix multiplication. This function is highly optimized for performance on many HPC systems, often outperforming custom implementations.


## 5. Parallel I/O and Data Management

Efficient data management and parallel I/O are crucial for handling large datasets in HPC environments. This section introduces techniques to optimize I/O operations and manage data effectively.





### 5.1 Installing Required Packages

In this section, we'll install the necessary packages to perform parallel I/O operations in Google Colab using `mpi4py`.


In [5]:
# Install the necessary MPI libraries and mpi4py package
!apt-get install -y libopenmpi-dev
!pip install mpi4py

# Verifying the installation
from mpi4py import MPI

print("mpi4py is successfully installed.")


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenmpi-dev is already the newest version (4.1.2-2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Collecting mpi4py
  Downloading mpi4py-4.0.0.tar.gz (464 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m464.8/464.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: mpi4py
  Building wheel for mpi4py (pyproject.toml) ... [?25l[?25hdone
  Created wheel for mpi4py: filename=mpi4py-4.0.0-cp310-cp310-linux_x86_64.whl size=4266257 sha256=aad712e3110e62b3d074ef92d20c123bd73792d1fbf84e578aca778908460467
  Stored in directory: /root/.cache/pip/wheels/96/17/12/83db63ee0ae5c4b040ee87f2e5c8

### 5.2 Parallel Filesystems

Parallel filesystems like Lustre or GPFS are designed to provide high-throughput access to large datasets by allowing multiple processes to read/write data simultaneously.

In this example, we'll use `mpi4py` to demonstrate simple parallel I/O. Note that running this code in a real HPC environment would involve a more complex setup, but this example provides a basic demonstration.


In [8]:
from mpi4py import MPI
import numpy as np

# Initialize MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Create a large array on each process, each filled with the rank number
data = np.full(1000000, rank, dtype='i')

# Write data to a shared file
fh = MPI.File.Open(comm, 'output.dat', MPI.MODE_CREATE | MPI.MODE_WRONLY)
fh.Write_at_all(rank * data.nbytes, data)
fh.Close()

# Synchronize processes
comm.Barrier()

# Reading data back collectively
collected_data = np.empty_like(data)
fh = MPI.File.Open(comm, 'output.dat', MPI.MODE_RDONLY)
fh.Read_at_all(rank * collected_data.nbytes, collected_data)
fh.Close()

# Verify the data by printing a summary from each process
print(f"Process {rank}: First element = {collected_data[0]}, Last element = {collected_data[-1]}")


Process 0: First element = 0, Last element = 0


### Explanation:

In this example, we use `mpi4py` to perform parallel I/O, where each process writes its portion of data to a shared file. This is a simple demonstration of how parallel I/O can be implemented in an HPC environment.


## 6. Introduction to Performance Tuning and Analysis

Performance tuning and analysis are crucial for maximizing the efficiency of HPC applications. This section introduces the fundamental steps involved in performance tuning, including identifying bottlenecks, applying optimizations, and verifying improvements.

### 6.1 Overview of Performance Tuning Steps

The general workflow for performance tuning involves:
1. **Profiling the code** to identify performance bottlenecks.
2. **Applying optimizations** to the identified bottlenecks.
3. **Reprofiling the code** to assess the impact of the optimizations.
4. **Iterating** until performance goals are met.

### 6.2 Setting Up the Environment

We'll start by setting up the necessary environment for performance analysis, including installing profiling tools and libraries needed for the exercises.


In [1]:
# Install necessary libraries
!pip install line_profiler
!apt-get install -y libopenmpi-dev
!pip install mpi4py

# Load the line_profiler extension
%load_ext line_profiler


Collecting line_profiler
  Downloading line_profiler-4.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (34 kB)
Downloading line_profiler-4.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (717 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m717.6/717.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: line_profiler
Successfully installed line_profiler-4.1.3
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenmpi-dev is already the newest version (4.1.2-2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


## 7. Profiling and Identifying Bottlenecks

In this section, we'll profile a computational code to identify the most time-consuming parts. Profiling is the first step in any performance tuning process.

### 7.1 Profiling with cProfile and line_profiler

We'll use `cProfile` for an overall view of the code's performance and `line_profiler` for detailed line-by-line analysis.


In [2]:
import cProfile
import numpy as np

def compute_heavy_task(A, B):
    C = np.dot(A, B)
    D = np.linalg.inv(C)
    E = np.sum(D)
    return E

# Create large random matrices
A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

# Profile the function
cProfile.run('compute_heavy_task(A, B)')

# Detailed line profiling
%lprun -f compute_heavy_task compute_heavy_task(A, B)


         30 function calls in 0.713 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.176    0.176    0.712    0.712 <ipython-input-2-fb907a52bf7a>:4(compute_heavy_task)
        1    0.001    0.001    0.713    0.713 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2172(_sum_dispatcher)
        1    0.000    0.000    0.010    0.010 fromnumeric.py:2177(sum)
        1    0.000    0.000    0.010    0.010 fromnumeric.py:71(_wrapreduction)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:72(<dictcomp>)
        1    0.000    0.000    0.000    0.000 linalg.py:130(get_linalg_error_extobj)
        1    0.000    0.000    0.000    0.000 linalg.py:135(_makearray)
        2    0.000    0.000    0.000    0.000 linalg.py:140(isComplexType)
        1    0.000    0.000    0.000    0.000 linalg.py:153(_realType)
        1    0.000    0.000    0.000    0.000 linalg.py:159(_commonType)
 

### Explanation:

The code above uses `cProfile` to profile the entire function and `line_profiler` for a detailed line-by-line breakdown. This helps in identifying which parts of the code are the most time-consuming.

### Exercise:

Try modifying the `compute_heavy_task` function by adding other operations, such as matrix transposition or element-wise multiplication. Re-run the profiling tools to see how the performance characteristics change.


## 8. Applying Optimizations

Once bottlenecks are identified, the next step is to apply optimizations. In this section, we will optimize matrix operations using techniques such as loop unrolling, vectorization, and memory access optimization.

### 8.1 Loop Unrolling and Vectorization

We will revisit loop unrolling and vectorization to see how they can improve performance in matrix operations.


In [3]:
import numpy as np
import time

def basic_matrix_sum(matrix):
    total = 0
    for i in range(matrix.shape[0]):
        for j in range(matrix.shape[1]):
            total += matrix[i, j]
    return total

def vectorized_matrix_sum(matrix):
    return np.sum(matrix)

# Create a large matrix
matrix = np.random.rand(10000, 10000)

# Measure time for basic matrix sum
start_time = time.time()
basic_sum = basic_matrix_sum(matrix)
print("Basic matrix sum time:", time.time() - start_time)

# Measure time for vectorized matrix sum
start_time = time.time()
vectorized_sum = vectorized_matrix_sum(matrix)
print("Vectorized matrix sum time:", time.time() - start_time)


Basic matrix sum time: 29.424485683441162
Vectorized matrix sum time: 0.08589839935302734


### Explanation:

This example compares the performance of a basic loop-based matrix sum with a vectorized version using NumPy's built-in `sum` function. Vectorization allows for faster computation by leveraging SIMD instructions.

### Exercise:

Try optimizing the `basic_matrix_sum` function by manually unrolling the loops. Measure the performance impact and compare it with the vectorized approach.


## 9. Memory Access Optimization and Cache Utilization

Memory access patterns greatly affect the performance of HPC applications. In this section, we'll explore techniques to optimize memory access and improve cache utilization.

### 9.1 Optimizing Memory Access Patterns

Efficient memory access patterns reduce cache misses, leading to faster execution times. We'll analyze the impact of row-major vs. column-major access.


In [4]:
import numpy as np
import time

def row_major_sum(matrix):
    total = 0
    for i in range(matrix.shape[0]):
        for j in range(matrix.shape[1]):
            total += matrix[i, j]
    return total

def column_major_sum(matrix):
    total = 0
    for j in range(matrix.shape[1]):
        for i in range(matrix.shape[0]):
            total += matrix[i, j]
    return total

# Create a large matrix
matrix = np.random.rand(10000, 10000)

# Measure row-major access time
start_time = time.time()
row_sum = row_major_sum(matrix)
print("Row-major sum time:", time.time() - start_time)

# Measure column-major access time
start_time = time.time()
column_sum = column_major_sum(matrix)
print("Column-major sum time:", time.time() - start_time)


Row-major sum time: 28.65447998046875
Column-major sum time: 35.93936228752136


### Explanation:

The example above compares row-major and column-major memory access patterns. Typically, row-major access is faster on most systems because it aligns better with how data is stored in memory.

### Exercise:

Modify the code to measure the cache hit rate (if possible using advanced profiling tools or libraries) for each access pattern. Observe how different matrix sizes affect cache utilization and performance.


## 10. Leveraging High-Performance Libraries

Using specialized HPC libraries can significantly enhance the performance of your applications. This section explores how to use BLAS, LAPACK, and other optimized libraries in your code.

### 10.1 Using BLAS and LAPACK for Matrix Operations

BLAS (Basic Linear Algebra Subprograms) and LAPACK are standard libraries providing highly optimized implementations of basic linear algebra routines.


In [7]:
import numpy as np
from scipy.linalg import blas, lapack

# Create large random matrices
A = np.random.rand(3, 3)  # Using smaller matrices for easier visualization
B = np.random.rand(3, 3)

# Using BLAS dgemm for matrix multiplication
C = blas.dgemm(1.0, A, B)

# Using LAPACK for matrix inversion (getrf followed by getri)
LU, piv, info = lapack.dgetrf(A)
inv_matrix, info = lapack.dgetri(LU, piv)

# Display the results
print("Matrix A:")
print(A)

print("\nMatrix B:")
print(B)

print("\nResult of BLAS matrix multiplication (A * B = C):")
print(C)

print("\nMatrix inversion of A using LAPACK:")
print(inv_matrix)


Matrix A:
[[0.19885336 0.41443779 0.55232533]
 [0.63718034 0.84068626 0.05538211]
 [0.88803966 0.1892175  0.63410325]]

Matrix B:
[[0.05203572 0.84679635 0.35407493]
 [0.77924956 0.11606839 0.82977504]
 [0.25053182 0.83445495 0.7396368 ]]

Result of BLAS matrix multiplication (A * B = C):
[[0.47167302 0.67738203 0.82281926]
 [0.70213551 0.68335297 0.96415271]
 [0.35252048 1.30308151 0.94044664]]

Matrix inversion of A using LAPACK:
[[-1.34380128  0.40701271  1.13494807]
 [ 0.91246533  0.93698744 -0.87662388]
 [ 1.60966597 -0.8496059   0.24916082]]


### Explanation:

This example demonstrates how to use the BLAS `dgemm` function for matrix multiplication and the LAPACK `dgetrf` function for matrix inversion. These libraries are optimized for performance on many HPC systems.

### Exercise:

Try using other functions from BLAS and LAPACK, such as `dsymv` for symmetric matrix-vector multiplication or `dgeev` for eigenvalue computation. Compare the performance of these library functions with your custom implementations.


## 11. Advanced Performance Tuning with Parallel I/O

Efficient I/O operations are critical for handling large datasets in HPC applications. This section covers advanced parallel I/O techniques using mpi4py.

### 11.1 Implementing Parallel I/O

We will extend our previous examples by implementing collective I/O operations, which can be more efficient for large-scale data processing.


In [9]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Create a large array on each process
data = np.full(1000000, rank, dtype='i')

# Write data collectively to a shared file
fh = MPI.File.Open(comm, 'collective_output.dat', MPI.MODE_CREATE | MPI.MODE_WRONLY)
fh.Write_at_all(rank * data.nbytes, data)
fh.Close()  # Manually close the file

# Reading data collectively
collected_data = np.empty_like(data)
fh = MPI.File.Open(comm, 'collective_output.dat', MPI.MODE_RDONLY)
fh.Read_at_all(rank * collected_data.nbytes, collected_data)
fh.Close()  # Manually close the file after reading

# Print out a summary of the data to verify the read operation
print(f"Process {rank}: First element = {collected_data[0]}, Last element = {collected_data[-1]}")


Process 0: First element = 0, Last element = 0


### Explanation:

In this example, each MPI process writes and reads a portion of data from a shared file using collective I/O operations. This technique improves the efficiency of data handling in parallel applications.

### Exercise:

Modify the code to test the performance impact of different file access modes, such as `MPI.MODE_APPEND` or non-collective I/O. Analyze how these changes affect the scalability of I/O operations when running on multiple processes.


## 12. Comprehensive Performance Analysis and Tuning

In this section, we will perform a comprehensive performance analysis and tuning of a complex HPC application. We will use profiling tools to identify bottlenecks and optimize the application.

### 12.1 Case Study: Performance Tuning of a Scientific Application

We will apply profiling, optimization, and parallel I/O techniques to a real-world scientific computation. The code will include matrix operations and parallel I/O.


In [11]:
import numpy as np
from scipy.linalg import blas, lapack
from mpi4py import MPI
import cProfile
import time

# MPI setup
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Problem size (reduced for better performance)
N = 500  # Smaller matrix size for quicker computation

# Create large random matrices
A = np.random.rand(N, N)
B = np.random.rand(N, N)

# Optimized computation
def optimized_computation(A, B):
    C = blas.dgemm(1.0, A, B)
    LU, piv, info = lapack.dgetrf(C)
    inv_matrix, info = lapack.dgetri(LU, piv)
    result = np.sum(inv_matrix)
    return result

# Profile the optimized computation
cProfile.run('optimized_computation(A, B)')

# Perform the computation
result = optimized_computation(A, B)

# Parallel I/O to save results
file_handle = MPI.File.Open(comm, 'final_result.dat', MPI.MODE_CREATE | MPI.MODE_WRONLY)
result_array = np.array([result], dtype='d')
file_handle.Write_at_all(rank * result_array.nbytes, result_array)
file_handle.Close()

# Final verification
print(f"Process {rank} completed its task and saved the result. Result sum: {result}")


         11 function calls in 0.061 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.060    0.060    0.061    0.061 <ipython-input-11-f856f0057939>:20(optimized_computation)
        1    0.000    0.000    0.061    0.061 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2172(_sum_dispatcher)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:2177(sum)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:71(_wrapreduction)
        1    0.000    0.000    0.000    0.000 fromnumeric.py:72(<dictcomp>)
        1    0.000    0.000    0.061    0.061 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
        1    0.000    0.000    0.0

### Explanation:

This case study brings together various optimization and parallelization techniques to solve a large-scale matrix problem. The code includes profiling, the use of high-performance libraries, and parallel I/O for saving the results.

### Exercise:

Expand the case study by adding more complex operations, such as eigenvalue computation or solving a system of linear equations. Profile and optimize these additional steps, and analyze how the performance scales with the problem size and number of processes.


## 13. MPI Programming in C with Performance Analysis

In this section, we will write a simple MPI program in C, compile it, and run it directly within the Jupyter notebook. We will also perform basic profiling to analyze the performance.

### 13.1 Writing the MPI Program in C

First, we'll write a simple C program that initializes an array with the rank of each process, gathers all the data at the root process, and prints a summary.


In [12]:
# Write the C program to a file
c_program = """
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    // Create an array and initialize it with the rank of the process
    int array_size = 1000000;
    int* data = (int*)malloc(array_size * sizeof(int));
    for (int i = 0; i < array_size; i++) {
        data[i] = rank;
    }

    // Gather data at the root process
    int* collected_data = NULL;
    if (rank == 0) {
        collected_data = (int*)malloc(size * array_size * sizeof(int));
    }

    MPI_Gather(data, array_size, MPI_INT, collected_data, array_size, MPI_INT, 0, MPI_COMM_WORLD);

    // Only the root process will print the first and last elements of each process's data
    if (rank == 0) {
        for (int i = 0; i < size; i++) {
            printf("Process %d: First element = %d, Last element = %d\\n", i, collected_data[i * array_size], collected_data[(i+1) * array_size - 1]);
        }
        free(collected_data);
    }

    free(data);
    MPI_Finalize();

    return 0;
}
"""

# Save the C program to a file
with open("mpi_example.c", "w") as file:
    file.write(c_program)

print("C program written to mpi_example.c")


C program written to mpi_example.c


### 13.2 Compiling the C Program

Next, we will compile the C program using `mpicc`. This is done directly in the notebook using shell commands.


In [13]:
# Compile the C program using mpicc
!mpicc -o mpi_example mpi_example.c

# Check if the compilation was successful
!ls -l mpi_example


-rwxr-xr-x 1 root root 16368 Aug 23 14:49 mpi_example


### 13.3 Running the Program

Now, we will run the compiled program using `mpirun` or `mpiexec`. We will specify the number of processes using the `-np` flag.


In [16]:
# Use the --allow-run-as-root flag with mpirun
!mpirun --allow-run-as-root -np 4 ./mpi_example



### 13.4 Profiling the Program

To profile the program, we can use `gprof`. This section will involve compiling the program with profiling enabled, running it to generate profile data, and then analyzing that data.

### Profiling with gprof

Compile the program with the `-pg` flag, which enables profiling:


In [19]:
# Compile with profiling enabled
!mpicc -pg -o mpi_example_profiled mpi_example.c

# Run the program (this generates the profiling data file gmon.out)
!mpirun -np 4 ./mpi_example_profiled

# Analyze the profiling data
!gprof ./mpi_example_profiled gmon.out > analysis.txt

# Display the profiling results
!cat analysis.txt


--------------------------------------------------------------------------
mpirun has detected an attempt to run as root.

Running as root is *strongly* discouraged as any mistake (e.g., in
defining TMPDIR) or bug can result in catastrophic damage to the OS
file system, leaving your system in an unusable state.

We strongly suggest that you run mpirun as a non-root user.

You can override this protection by adding the --allow-run-as-root option
to the cmd line or by setting two environment variables in the following way:
the variable OMPI_ALLOW_RUN_AS_ROOT=1 to indicate the desire to override this
protection, and OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 to confirm the choice and
add one more layer of certainty that you want to do so.
We reiterate our advice against doing so - please proceed at your own risk.
--------------------------------------------------------------------------
gmon.out: No such file or directory


In this section, we wrote a simple MPI program in C, compiled it, and ran it directly within the Jupyter notebook. We also performed basic profiling using `gprof` to analyze the program's performance. This exercise demonstrated the end-to-end process of developing, running, and profiling an MPI-based HPC application.

### Exercises:

- **Modify the C Program**: Try increasing the size of the array or changing the type of data being processed, and observe how these changes impact performance as reported by `gprof`.
- **Explore Advanced Profiling Tools**: For more detailed analysis, consider using tools like `perf` or `Intel VTune` to gain deeper insights into your program's performance.
