# ‚≠ê Tutorial: High-Performance Computing (HPC) with `parallel_run`

This notebook demonstrates how to use the `RiskLabAI.hpc.parallel_run` utility to significantly speed up your code by executing it on multiple CPUs.

Many tasks in finance, like running a Monte Carlo simulation or backtesting many parameters, are 'embarrassingly parallel'. This means the work can be split into independent jobs and run simultaneously.

We will:
1.  Define a 'slow' task that simulates a piece of work.
2.  Run it **serially** (on 1 CPU) to get a baseline time.
3.  Run it **in parallel** using `parallel_run` with both `lin_partition=False` (item-by-item) and `lin_partition=True` (chunked).
4.  Compare the execution times.

## 0. Setup and Imports

In [None]:
# Standard Imports
import time
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing

# RiskLabAI Imports
from RiskLabAI.hpc.hpc import parallel_run
import RiskLabAI.utils.publication_plots as pub_plots

# Setup plotting and configuration
pub_plots.setup_publication_style()
N_JOBS = 40 # Total number of jobs to run
SLEEP_TIME = 0.1 # Each job takes 0.1s
jobs_list = list(range(N_JOBS))
N_CPUS = multiprocessing.cpu_count()
print(f"Running {N_JOBS} jobs. Expected serial time: {N_JOBS * SLEEP_TIME:.1f}s")
print(f"Using {N_CPUS} CPUs for parallel execution.")

## 1. The Slow Way: Serial Execution

First, we define our simple task and run it in a standard `for` loop. This will be our benchmark.

In [None]:
def simple_task(item):
    """A simple task that simulates 0.1s of work."""
    time.sleep(SLEEP_TIME)
    return item * 2


print("Running serially (1 CPU)...")
start_time_serial = time.time()
results_serial = [simple_task(job) for job in jobs_list]
end_time_serial = time.time()

serial_time = end_time_serial - start_time_serial
print(f"Serial execution took: {serial_time:.2f} seconds")

## 2. The Fast Way: Parallel (Item-by-Item)

Now we use `parallel_run` with `lin_partition=False`. 

The `parallel_run` function handles all the logic: it dispatches one job to each available CPU, waits for it to finish, and dispatches the next, until all jobs are done. Notice that our `simple_task` function is the *exact same* as in the serial version.

In [None]:
print(f"Running in parallel ({N_CPUS} CPUs, item-by-item)...")
start_time_parallel = time.time()

results_parallel = parallel_run(
    simple_task, 
    jobs_list, 
    lin_partition=False
)

end_time_parallel = time.time()
parallel_time = end_time_parallel - start_time_parallel

print(f"Parallel (item-by-item) execution took: {parallel_time:.2f} seconds")
print(f"Results match: {results_serial == results_parallel}")

## 3. The 'Chunked' Method (`lin_partition=True`)

This is the second mode, which is the default in your original file. Here, `parallel_run` splits the list of 40 jobs into `N_CPUS` chunks. 

This requires us to write a *different* target function (`chunked_task`) that is designed to receive a *list of indices* (e.g., `[0, 1, 2, 3]`) and loop over them itself.

In [None]:
def chunked_task(indices):
    """
    A task designed for lin_partition=True.
    It receives a list of *indices* and must process them.
    """
    local_results = []
    for idx in indices:
        # We get the item from the global 'jobs_list'
        item = jobs_list[idx]
        result = simple_task(item) # Re-use the 0.1s sleep
        local_results.append(result)
    return local_results


print(f"Running in parallel ({N_CPUS} CPUs, by chunk)...")
start_time_chunked = time.time()

results_chunked = parallel_run(
    chunked_task,
    jobs_list, # 'jobs_list' is only used for its length here
    lin_partition=True
)

end_time_chunked = time.time()
chunked_time = end_time_chunked - start_time_chunked

print(f"Chunked parallel execution took: {chunked_time:.2f} seconds")
print(f"Results match: {results_serial == results_chunked}")

## 4. Performance Comparison

Finally, let's plot the results. Both parallel methods should be significantly faster than the serial method. The small difference between the two parallel methods is due to `joblib`'s overhead.

In [None]:
times = [serial_time, parallel_time, chunked_time]
labels = ['Serial (1 CPU)', 'Parallel (Item-by-Item)', 'Parallel (Chunked)']
colors = ['#d11141', '#00aedb', '#00b159']

fig, ax = plt.subplots(figsize=(8, 5))
bars = ax.bar(labels, times, color=colors)
ax.bar_label(bars, fmt='%.2fs')

pub_plots.apply_plot_style(
    ax,
    title='Parallel vs. Serial Execution Time',
    xlabel='Execution Method',
    ylabel='Time Taken (seconds)'
)
ax.grid(axis='x') # Turn off vertical grid lines for bar chart
plt.tight_layout()
plt.show()