# ML4HPC: CNN

### Team Members:
- Luca Venerando Greco
- Bice Marzagora
- Elia Vaglietti


### Importing Libraries

### Library Descriptions

1. **NumPy (`np`)**:
   - A fundamental package for scientific computing with Python. It provides support for arrays, matrices, and many mathematical functions to operate on these data structures.

2. **Matplotlib (`plt`)**:
   - A plotting library for creating static, animated, and interactive visualizations in Python. It is widely used for generating plots, histograms, bar charts, and other types of graphs.

3. **Keras (`mnist`)**:
   - A high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. The `mnist` module provides access to the MNIST dataset, a large database of handwritten digits commonly used for training various image processing systems.

4. **JAX (`jax`, `jnp`, `grad`)**:
   - A library for high-performance machine learning research. It provides NumPy-like API (`jax.numpy` or `jnp`) with automatic differentiation (`grad`), GPU/TPU acceleration, and just-in-time compilation to optimize performance.

5. **Time (`time`)**:
   - A standard Python library for time-related functions. It provides various time-related functions such as getting the current time, measuring the execution time of code, and more.

6. **OS (`os`)**:
   - A standard Python library for interacting with the operating system. It provides functions to interact with the file system, manage directories, and handle environment variables.

7. **TQDM (`tqdm`)**:
   - A library for creating progress bars in Python. It is useful for tracking the progress of loops and long-running operations, providing a visual indication of progress.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
import jax
import jax.numpy as jnp
from jax import grad
import time
import os
from tqdm import tqdm

2024-12-02 23:27:38.515255: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-12-02 23:27:38.519489: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-12-02 23:27:38.530299: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1733178458.547251 2830939 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733178458.552997 2830939 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-02 23:27:38.572671: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

### Directory Setup and Job Configuration

We now set up the necessary directories and define the job configurations. Specifically, we create folders for storing data and logs, if they do not already exist.

If no new data is needed, set the `GENERATE_DATA` variable to `False` to skip the data generation step.


In [2]:
current_dir = os.getcwd()

data_folder = "data"
logs_folder = "logs"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)

if not os.path.exists(logs_folder):
    os.makedirs(logs_folder)

n_runs = 30

GENERATE_DATA = False

### Job Submission Function

We define a function `submit_job` that handles the submission of jobs to the scheduler. This function takes the number of nodes, the number of epochs and a job name as input parameters. It creates the necessary directories for storing data and logs, reads a template launch script, formats it with the provided parameters, writes the formatted script to a file, and submits the job using the `sbatch` command.

Given the diverse types of jobs needed to be launched in this project not all the variables are used in the function. However, we keep them in the function definition to maintain consistency across the different job types.

In [3]:
def submit_job(launch_file, num_nodes, job_name, num_epochs):
    num_tasks_per_node = 128

    if num_tasks_per_node > 128:
        print("The number of tasks per node should be less than or equal to 128")
        exit(1)

    if not os.path.exists(f"{data_folder}/{job_name}"):
        os.makedirs(f"{data_folder}/{job_name}")

    if not os.path.exists(f"{logs_folder}/{job_name}"):
        os.makedirs(f"{logs_folder}/{job_name}")

    with open(launch_file, 'r') as file:
        launch_script = file.read()

    launch_script = launch_script.format(
        num_nodes=num_nodes,
        num_tasks_per_node=num_tasks_per_node,
        current_dir=current_dir,
        world_size=num_nodes*num_tasks_per_node,
        num_epochs=num_epochs,
        data_folder=f"{data_folder}/{job_name}",
        logs_folder=f"{logs_folder}/{job_name}"
    )

    script_filename = f"{logs_folder}/{job_name}/{launch_file.split('/')[-1]}"
    with open(script_filename, "w") as script_file:
        script_file.write(launch_script)

    os.system(f"sbatch {script_filename}")

### Defining Test Functions

In the following sections, we define functions to run different scalability tests. These functions will help us automate the process of submitting jobs for one million forecasters, strong scaling, and weak scaling tests. Each function will generate a unique job name, submit the job using the `submit_job` function, and return the job names for tracking purposes.

In [4]:
def mpi_test():
    job_names = []
    for i in range(n_runs):
        run_dir = f"{data_folder}/ten_nodes_test/run_{i}"
        if not os.path.exists(run_dir):
            os.makedirs(run_dir)
        
        job_name = f"/ten_nodes_test/run_{i}"

        submit_job("launchers/launch_cpu_batch.sh", 1, job_name, 10)

        job_names.append(job_name)

    return job_names

In [5]:
def gpu_test():
    job_names = []
    for i in range(n_runs):
        run_dir = f"{data_folder}/gpu_test/run_{i}"
        if not os.path.exists(run_dir):
            os.makedirs(run_dir)
        
        job_name = f"/gpu_test/run_{i}"

        submit_job("launchers/launch_gpu.sh", 1, job_name, 10)

        job_names.append(job_name)

    return job_names

In [6]:
def baseline_test():
    job_names = []
    for i in range(n_runs):
        run_dir = f"{data_folder}/baseline_test/run_{i}"
        if not os.path.exists(run_dir):
            os.makedirs(run_dir)
        
        job_name = f"/baseline_test/run_{i}"

        submit_job("launchers/launch_baseline.sh", 1, job_name, 10)

        job_names.append(job_name)

    return job_names

### Waiting for jobs

Now we wait for all the jobs to complete, in the meantime the `tqdm` progress bar will be updated.

In [7]:
import socket

hostname = socket.gethostname()
print(f"Hostname: {hostname}")

Hostname: access1.aion-cluster.uni.lux


In [8]:
all_jobs_to_wait = []

if GENERATE_DATA:
    # if hostname contains "aion":
    if "aion" in hostname:
        all_jobs_to_wait.extend(baseline_test())
        all_jobs_to_wait.extend(mpi_test())
    elif "iris" in hostname:
        all_jobs_to_wait.extend(gpu_test())

    print("Waiting for joparallel_cnn/launchers/launch_gpu.shbs to finish...")
    print(all_jobs_to_wait)

In [9]:
for job_name in tqdm(all_jobs_to_wait):
    while not os.path.exists(f"{data_folder}/{job_name}/timings.txt"):
        time.sleep(10)  # Poll every 10 seconds

0it [00:00, ?it/s]


### Timing Analysis

In this section, we analyze the execution times 30 times of the baseline, the mpi and gpu version. We read the timing data from the generated files, calculate the mean and standard deviation of the execution times, and create a dataframe to summarize the results.

The dataframe includes the following columns:
- **Run**: The run identifier.
- **Timing**: The total execution time for each run.
- **CPU time**: The sum of CPU times across all ranks for each run.

We then print the dataframe and the calculated mean and standard deviation of the execution times.

In [10]:
import pandas as pd

def get_mean_and_std_of_times(job_name):
    timings = []
    cpu_times = []

    for i in range(n_runs):
        with open(f"{data_folder}/{job_name}/run_{i}/timings.txt", "r") as file:
            lines = file.readlines()
            timings.append(float(lines[0].lstrip("Real time:")))
            cpu_times.append(float(lines[1].lstrip("CPU time:")))
        
    df = pd.DataFrame({
        'Run': [f'run_{i}' for i in range(n_runs)],
        'Timing': timings,
        'CPU Time': cpu_times
    })

    mean_timing = df['Timing'].mean()
    std_timing = df['Timing'].std()

    return df, mean_timing, std_timing

In [12]:
# Get the dataframe for GPU test
df_gpu, mean_timing_gpu, std_timing_gpu = get_mean_and_std_of_times("gpu_test")
print("GPU Test DataFrame:")
print(df_gpu)
print(f"Mean Timing: {mean_timing_gpu}, Std Timing: {std_timing_gpu}")

# Get the dataframe for Baseline test
df_cpu, mean_timing_cpu, std_timing_cpu = get_mean_and_std_of_times("ten_nodes_test")
print("Baseline Test DataFrame:")
print(df_cpu)
print(f"Mean Timing: {mean_timing_cpu}, Std Timing: {std_timing_cpu}")

# Get the dataframe for Baseline test
df_baseline, mean_timing_baseline, std_timing_baseline = get_mean_and_std_of_times("baseline_test")
print("Baseline Test DataFrame:")
print(df_baseline)
print(f"Mean Timing: {mean_timing_baseline}, Std Timing: {std_timing_baseline}")

# print mean speedup
speedup = mean_timing_baseline / mean_timing_gpu
print(f"Speedup GPU: {speedup}")

# print mean speedup
speedup = mean_timing_baseline / mean_timing_cpu
print(f"Speedup CPU: {speedup}")


GPU Test DataFrame:
       Run     Timing    CPU Time
0    run_0  7943.5663   9385.8839
1    run_1  8657.1960   9906.1757
2    run_2  6849.7040   8055.2187
3    run_3  6726.0427   7879.7719
4    run_4  7968.1344   9146.6158
5    run_5  8686.6887   9970.8935
6    run_6  6764.4210   7966.2162
7    run_7  6695.2882   7853.4974
8    run_8  7937.4752   9067.6058
9    run_9  8851.9454  10114.9056
10  run_10  6878.6781   8093.9334
11  run_11  6734.8552   7911.2338
12  run_12  8223.0523   9514.6220
13  run_13  9060.9863  10384.1862
14  run_14  6891.7856   8101.5388
15  run_15  6729.4085   7898.5693
16  run_16  7845.9373   9304.7396
17  run_17  8792.5570  10070.7770
18  run_18  6828.6254   8039.3488
19  run_19  6702.5599   7865.9788
20  run_20  7960.5973   9073.0003
21  run_21  8526.8507   9813.1914
22  run_22  6816.2103   8023.7662
23  run_23  6722.7135   7891.0677
24  run_24  8140.6264   9321.3822
25  run_25  8773.0870  10080.3057
26  run_26  6835.0856   8048.5783
27  run_27  6707.6850   7880

## Speedup analysis

### MPI

The speedup obtained for the MPI implementation is 1.36. This indicates that the MPI implementation is 1.36 times faster than the baseline CPU implementation.

The speedup value suggests that the parallelization using MPI has effectively reduced the execution time, although the improvement is not substantial. This moderate speedup could be due to the overhead associated with communication between nodes and the relatively small problem size, which may not fully leverage the benefits of parallel execution.

### GPU

The speedup obtained for the GPU implementation is 7.81. This indicates that the GPU implementation is 7.81 times faster than the baseline CPU implementation. The significant speedup demonstrates the effectiveness of leveraging GPU acceleration for this workload.

GPUs are well-suited for parallel processing tasks, and their ability to handle multiple operations simultaneously has resulted in a substantial reduction in execution time compared to the CPU implementation. This highlights the potential benefits of utilizing GPU resources for computationally intensive tasks in high-performance computing applications.

### Conclusion
The results of this project demonstrate the potential benefits of utilizing parallel computing techniques, such as MPI and GPU acceleration, for high-performance computing applications. While the MPI implementation provided a moderate improvement, the GPU implementation significantly outperformed the baseline CPU implementation, showcasing the advantages of GPU acceleration for computationally intensive tasks.

Future work could involve optimizing the MPI implementation to reduce communication overhead and exploring hybrid approaches that combine MPI and GPU acceleration to further enhance performance. Additionally, scaling the problem size and testing on larger datasets could provide more insights into the scalability and efficiency of the parallel implementations.