# ML4HPC: Ensemble of Forecasters

### Team Members:
- Luca Venerando Greco
- Bice Marzagora
- Elia Vaglietti


### Importing Libraries

In this notebook, we start by importing several essential libraries that are used throughout the workflow:

- `os`: Provides functions for interacting with the operating system, such as creating directories and handling file paths.
- `sys`: Provides access to some variables used or maintained by the Python interpreter and to functions that interact strongly with the interpreter.
- `matplotlib.pyplot`: A plotting library used for creating static, animated, and interactive visualizations in Python.
- `numpy`: A fundamental package for scientific computing with Python, used for working with arrays and matrices.
- `time`: Provides various time-related functions.
- `tqdm`: A library for creating progress bars and progress meters.

These libraries are crucial for tasks such as data manipulation, visualization, and managing the execution of jobs in a high-performance computing environment.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
import jax
import jax.numpy as jnp
from jax import grad
import time
import os
import tqdm

2024-12-02 16:35:44.802400: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-12-02 16:35:44.808143: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-12-02 16:35:44.822348: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1733153744.842786 2227458 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1733153744.849396 2227458 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-02 16:35:44.876063: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU ins

### Directory Setup and Job Configuration

We now set up the necessary directories and define the job configurations. Specifically, we create folders for storing charts, data, and logs if they do not already exist. We also define the number of forecasters and nodes for different scaling tests.

If no new data is needed, set the `GENERATE_DATA` variable to `False` to skip the data generation step.


In [2]:
current_dir = os.getcwd()

charts_folder = "charts"
data_folder = "data"
logs_folder = "logs"

if not os.path.exists(data_folder):
    os.makedirs(data_folder)

if not os.path.exists(logs_folder):
    os.makedirs(logs_folder)

if not os.path.exists(charts_folder):
    os.makedirs(charts_folder)

ten_nodes = 10
strong_scaling_nodes = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # Number of nodes to test
weak_scaling_nodes   = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  # Number of nodes to test

n_runs = 30

GENERATE_DATA = True

### Job Submission Function

We define a function `submit_job` that handles the submission of jobs to the scheduler. This function takes the number of nodes, the number of forecasters, and a job name as input parameters. It creates the necessary directories for storing data and logs, reads a template launch script, formats it with the provided parameters, writes the formatted script to a file, and submits the job using the `sbatch` command.

In [3]:
def submit_job(launch_file, num_nodes, job_name, num_epochs):
    num_tasks_per_node = 128

    if num_tasks_per_node > 128:
        print("The number of tasks per node should be less than or equal to 128")
        exit(1)

    if not os.path.exists(f"{data_folder}/{job_name}"):
        os.makedirs(f"{data_folder}/{job_name}")

    if not os.path.exists(f"{logs_folder}/{job_name}"):
        os.makedirs(f"{logs_folder}/{job_name}")

    with open(launch_file, 'r') as file:
        launch_script = file.read()

    launch_script = launch_script.format(
        num_nodes=num_nodes,
        num_tasks_per_node=num_tasks_per_node,
        current_dir=current_dir,
        world_size=num_nodes*num_tasks_per_node,
        num_epochs=num_epochs,
        data_folder=f"{data_folder}/{job_name}",
        logs_folder=f"{logs_folder}/{job_name}"
    )

    script_filename = f"{logs_folder}/{job_name}/{launch_file.split('/')[-1]}"
    with open(script_filename, "w") as script_file:
        script_file.write(launch_script)

    os.system(f"sbatch {script_filename}")

### Defining Test Functions

In the following sections, we define functions to run different scalability tests. These functions will help us automate the process of submitting jobs for one million forecasters, strong scaling, and weak scaling tests. Each function will generate a unique job name, submit the job using the `submit_job` function, and return the job names for tracking purposes.

In [None]:
def ten_nodes_test():
    job_names = []
    for i in range(n_runs):
        run_dir = f"{data_folder}/ten_nodes_test/run_{i}"
        if not os.path.exists(run_dir):
            os.makedirs(run_dir)
        
        job_name = f"/ten_nodes_test/run_{i}"

        submit_job(ten_nodes, job_name)

        job_names.append(job_name)

    return job_names

In [4]:
def gpu_test():
    job_names = []
    for i in range(n_runs):
        run_dir = f"{data_folder}/gpu_test/run_{i}"
        if not os.path.exists(run_dir):
            os.makedirs(run_dir)
        
        job_name = f"/gpu_test/run_{i}"

        submit_job("launchers/launch_gpu.sh", 1, job_name, 10)

        job_names.append(job_name)

    return job_names

In [None]:
def strong_scaling():
    job_names = []
    for run in range(1):
        for num_nodes in strong_scaling_nodes:
            job_name = f"/strong_scaling/run_{run}/nodes_{num_nodes}"
            submit_job(num_nodes, job_name)
            job_names.append(job_name)
    return job_names

In [None]:
# def weak_scaling():
#     job_names = []
#     for run in range(1):
#         for num_nodes in weak_scaling_nodes:
#             job_name = f"/weak_scaling/run_{run}/nodes_{num_nodes}_forecasters_{weak_scaling_forecasters*num_nodes}"
#             submit_job(num_nodes, job_name)
#             job_names.append(job_name)
#     return job_names

In [5]:
def baseline_test():
    job_names = []
    for i in range(n_runs):
        run_dir = f"{data_folder}/baseline_test/run_{i}"
        if not os.path.exists(run_dir):
            os.makedirs(run_dir)
        
        job_name = f"/baseline_test/run_{i}"

        submit_job("launchers/launch_baseline.sh", 1, job_name, 10)

        job_names.append(job_name)

    return job_names

### Waiting for jobs

Now we wait for all the jobs to complete, in the meantime the `tqdm` progress bar will be updated.

In [6]:
import socket

hostname = socket.gethostname()
print(f"Hostname: {hostname}")

Hostname: access1.iris-cluster.uni.lux


In [7]:
all_jobs_to_wait = []

if GENERATE_DATA:
    # if hostname contains "aion":
    if "aion" in hostname:
        all_jobs_to_wait.extend(baseline_test())
        # all_jobs_to_wait.extend(ten_nodes_test())
        # all_jobs_to_wait.extend(strong_scaling())
        # all_jobs_to_wait.extend(weak_scaling())
    elif "iris" in hostname:
        all_jobs_to_wait.extend(gpu_test())

    print("Waiting for jobs to finish...")
    print(all_jobs_to_wait)

Submitted batch job 3692889
Submitted batch job 3692890
Submitted batch job 3692891
Submitted batch job 3692892
Submitted batch job 3692893
Submitted batch job 3692894
Submitted batch job 3692895
Submitted batch job 3692896
Submitted batch job 3692897
Submitted batch job 3692898
Submitted batch job 3692899
Submitted batch job 3692900
Submitted batch job 3692901
Submitted batch job 3692902
Submitted batch job 3692903
Submitted batch job 3692904
Submitted batch job 3692905
Submitted batch job 3692906
Submitted batch job 3692907
Submitted batch job 3692908
Submitted batch job 3692909
Submitted batch job 3692910
Submitted batch job 3692911
Submitted batch job 3692912
Submitted batch job 3692913
Submitted batch job 3692914
Submitted batch job 3692915
Submitted batch job 3692916
Submitted batch job 3692917
Submitted batch job 3692918
Waiting for jobs to finish...
['/gpu_test/run_0', '/gpu_test/run_1', '/gpu_test/run_2', '/gpu_test/run_3', '/gpu_test/run_4', '/gpu_test/run_5', '/gpu_test/run_

In [None]:
for job_name in tqdm(all_jobs_to_wait):
    while not os.path.exists(f"{data_folder}/{job_name}/timings.txt"):
        time.sleep(10)  # Poll every 10 seconds

### Timing Analysis

In this section, we analyze the execution times for the one million forecasters test. We read the timing data from the generated files, calculate the mean and standard deviation of the execution times, and create a dataframe to summarize the results.

The dataframe includes the following columns:
- **Run**: The run identifier.
- **Timing**: The total execution time for each run.
- **Aggregate sum CPU time**: The sum of CPU times across all ranks for each run.
- **Aggregate mean CPU time**: The mean CPU time across all ranks for each run.

We then print the dataframe and the calculated mean and standard deviation of the execution times.

In [9]:
import pandas as pd

def get_mean_and_std_of_times(job_name):
    timings = []
    cpu_times = []

    for i in range(n_runs):
        with open(f"{data_folder}/{job_name}/run_{i}/timings.txt", "r") as file:
            lines = file.readlines()
            timings.append(float(lines[0].lstrip("Real time:")))
            cpu_times.append(float(lines[1].lstrip("CPU time:")))
        
    df = pd.DataFrame({
        'Run': [f'run_{i}' for i in range(n_runs)],
        'Timing': timings,
        'CPU Time': cpu_times
    })

    mean_timing = df['Timing'].mean()
    std_timing = df['Timing'].std()

    return df, mean_timing, std_timing

In [11]:
# Get the dataframe for GPU test
df_gpu, mean_timing_gpu, std_timing_gpu = get_mean_and_std_of_times("gpu_test")
print("GPU Test DataFrame:")
print(df_gpu)
print(f"Mean Timing: {mean_timing_gpu}, Std Timing: {std_timing_gpu}")

# Get the dataframe for Baseline test
df_baseline, mean_timing_baseline, std_timing_baseline = get_mean_and_std_of_times("baseline_test")
print("Baseline Test DataFrame:")
print(df_baseline)
print(f"Mean Timing: {mean_timing_baseline}, Std Timing: {std_timing_baseline}")

# print mean speedup
speedup = mean_timing_baseline / mean_timing_gpu
print(f"Speedup: {speedup}")


GPU Test DataFrame:
       Run      Timing    CPU Time
0    run_0  36535.3858  47866.0677
1    run_1  28292.5568  39011.7775
2    run_2  37057.9193  48107.5929
3    run_3  28184.3531  39052.3030
4    run_4  36291.9042  47296.5360
5    run_5  36207.7599  47648.9669
6    run_6  28130.3158  39049.0143
7    run_7  35866.5569  47147.0061
8    run_8  36473.8846  47676.1727
9    run_9  28066.5364  38949.7274
10  run_10  35993.3417  46888.1049
11  run_11  36634.2409  47894.0841
12  run_12  28108.6109  39005.7201
13  run_13  36261.7977  47453.9584
14  run_14  28018.8692  38841.6865
15  run_15  36855.7129  47627.9897
16  run_16  28705.5709  39749.2466
17  run_17  36283.8998  47314.1508
18  run_18  29007.5645  40200.9690
19  run_19  28989.4414  40092.6941
20  run_20  35870.3802  46857.8725
21  run_21  29645.9053  41005.6306
22  run_22  28750.2661  39820.2386
23  run_23  29579.6151  40953.0376
24  run_24  36011.0643  47100.4461
25  run_25  28775.6436  39838.9675
26  run_26  29742.9457  41040.6240


### Strong Scalability Test

In this section, we analyze the execution times for the strong scalability test. We have already submitted jobs for different numbers of nodes and collected the execution times. The results are plotted on a logarithmic scale to better visualize the differences in execution times as the number of nodes increases.

The strong scalability test helps us understand how the execution time decreases as we increase the number of nodes while keeping the problem size constant. Ideally, the execution time should decrease proportionally with the increase in the number of nodes, indicating efficient parallelization and resource utilization.

In [None]:
execution_times_strong_scaling = []

# Submit jobs for each test configuration
for num_nodes in strong_scaling_nodes:
    execution_time_file = f"{data_folder}/strong_scaling/run_0/nodes_{num_nodes}/timings.txt"

    with open(execution_time_file, "r") as f:
        line = f.readline().strip()
        execution_time = float(line.replace("Total execution time: ", ""))
    execution_times_strong_scaling.append(execution_time)
    print(f"Execution time for {num_nodes} nodes: {execution_time} seconds")


In [None]:
# Plot the results
plt.figure(figsize=(10, 6))
plt.yscale('log')
plt.plot(strong_scaling_nodes, execution_times_strong_scaling, label='Strong Scaling', marker='o')
plt.xlabel('Number of Nodes')
plt.ylabel('Execution Time (seconds)')
plt.title('Strong Scalability Test')
plt.grid(True)
plt.legend()
plt.savefig(f"{charts_folder}/scalability_plot.png") 
plt.show()

### Weak Scalability Test

In this section, we analyze the execution times for the weak scalability test. 

The weak scalability test helps us understand how the execution time changes as we increase the number of nodes while keeping the workload per node constant. Ideally, the execution time should remain constant with the increase in the number of nodes, indicating efficient parallelization and resource utilization.

In [None]:
# execution_times_weak_scaling = []

# # Submit jobs for each test configuration
# for num_nodes in weak_scaling_nodes:
#     execution_time_file = f"{data_folder}/weak_scaling/run_0/nodes_{num_nodes}_forecasters_{weak_scaling_forecasters*num_nodes}/timings.txt"

#     with open(execution_time_file, "r") as f:
#         line = f.readline().strip()
#         execution_time = float(line.replace("Total execution time: ", ""))
#     execution_times_weak_scaling.append(execution_time)

In [None]:

# # Plot the weak scalability results
# plt.figure(figsize=(10, 6))
# plt.plot(weak_scaling_nodes, execution_times_weak_scaling, label='Weak Scaling', marker='o')
# plt.xlabel("Number of Nodes")
# plt.ylabel("Execution Time (seconds)")
# plt.title("Weak Scalability Test")
# plt.grid(True)
# plt.savefig(f"{charts_folder}/weak_scalability.png")
# plt.show()

### Conclusion

In this notebook, we have successfully set up and executed a series of scalability tests for an ensemble of forecasters in a high-performance computing environment. We started by importing the necessary libraries and setting up the directory structure. We then defined functions to submit jobs for one million forecasters, strong scaling, and weak scaling tests.

We visualized the data generated from the one million forecasters test, analyzing the distribution of forecast, weights, and biases. The histograms confirmed that these variables are normally distributed, as expected.

The strong scalability test demonstrated how the execution time decreases with an increasing number of nodes, indicating efficient parallelization. The weak scalability test showed that the execution time remains relatively constant as the number of nodes increases, suggesting good resource utilization.

Overall, these tests provide valuable insights into the performance and scalability of our forecasting ensemble, helping us optimize and improve our high-performance computing workflows.