<a href="https://colab.research.google.com/github/Thomas-Fabbris/parallel-computing-polimi/blob/main/OPENMP/OpenMP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **OpenMP**
A macro-based approach for expressing thread level parallelism in C

## **Setup**

In [1]:
%%capture
!apt install build-essential libomp-dev
!mkdir /home/OpenMP
%cd /home/OpenMP

## **Glossary**

OpenMP hides from the programmer all the pedantic complexity of managing POSIX threads, deferring that to the compiler and exposing a simple set of directives/pragmas to specify how the code needs to be parallelized.
For instance, in the OpenMP model, the fork and join operations happen automatically at the stard and end of parallel constructs.
Similarly, using locks just requires denoting the critical sections of the code that must be protected by one.

Some good doc can be found here: https://rookiehpc.org/openmp/docs/index.html
<br>
Another good introduction is this on: https://github-pages.ucl.ac.uk/research-computing-with-cpp/08openmp/02_intro_openmp.html

### Pragma Syntax

OpenMP directives in C/C++ use the following general syntax (square brackets denote optional parts):

```
#pragma omp <directive> [clause[[,] clause] ...]
```

Clauses that refine the behavior of the directive to which they are applied.

Common clauses:

- `num_threads(n)` : specify the number of threads to use.
- `nowait` : remove the implicit barrier at the end of a construct.
- `if(cond)` : execute in parallel only if the condition is true.

Data sharing clauses:

- `private(varlist)` : each thread has its own uninitialized copy of listed variables.
- `firstprivate(varlist)` : like `private`, but each copy is initialized with the original value.
- `lastprivate(varlist)` : copies the value from the last iteration or section (lexicographically - w.r.t. the order as written in the code) back to the original variable; a very simple way to avoid concurrent writes.
- `shared(varlist)` : variables are shared among all threads.
- `reduction(operator : varlist)` : perform a reduction operation across threads on the given variables.
- `default(shared|private|firstprivate|lastprivate|none)` : defines the default sharing type for variables in the region; if not specified, it is `shared`; if set to `none` the compiler forces you to manually specify a sharing clause for each variable access by the thread.

Some pragmas are declarative and can sit anywhere in the serial part of the code (e.g. `omp threadprivate`), others, like worksharing ones (e.g. `omp section`), act on the block that immediately follows.
Thus, if such a pragma preceeds a control statement, it acts on its code block, otherwise a block can be induced manually with a pair of `{}`.

*Note: in C/C++ a **structured block**, is defined as a single statement or a sequence of statements that is enclosed in curly braces. It shall be such that execution may never branch into the sequence or out of it, going through it entirely after entering it. This is sometimes called "Single Entry, Single Exit" (SESE). A multi-statement block often coincides with a scope.*

Example:
```
int a = 1, b, c, d, s;
#pragma omp parallel for default(private) firstprivate(a, b) lastprivate(c) shared(s) num_threads(10)
for (int i = 0; i < 10; ++i) {
  a += 1; // everyone increments their 'a' copy from 1 to 2
  c = i;  // the master thread's 'c' will be 9 after the loop
  d = 4;  // everyone initializes its copy of 'd' to 4
  s = i;  // race condition for who will write the final 's'
}
```

Threadprivate variables:

- Declared with: `#pragma omp threadprivate(varlist)`
- Define global or file-scope variables private to each thread across parallel regions.
- Unlike `private`, their value persists across parallel regions.
- Use the `copyin(varlist)` clause to initialize threadprivate variables in each thread from the master thread.

---

### Common Pragmas

The key parallel constructs:

- `#pragma omp parallel` : start a parallel region executed by a team of multiple threads.

A team is a group of threads that work together to execute a region of code.
Each team has one master thread (the thread that encounters the parallel directive) and zero or more additional worker threads.
Within a team, threads can execute work concurrently using work-sharing directives or further nested parallel constructs.
When the parallel region ends, the team disbands, and execution continues with a single thread (the master).

Work-sharing directives:

- `#pragma omp for` : distribute loop iterations among threads in a parallel region (\*). <!--the `omp loop` pragma is its modern, more flexible, counterpart-->
- `#pragma omp sections` / `#pragma omp section` : divide work into distinct code blocks, each executed by one thread in a parallel region.

These can be combined with `omp parallel` to write just one pragma instead of two to do the same thing, like `omp parallel for` to simultaneously open a parallel region and parallelize a loop.

These pragmas terminate in an **implicit barrier** that waits for all threads to complete the work they were assigned as part of the pragma.

The parallel directive creates **one** team of threads.
Remember, **you can't "divide twice" within one team**, therefore you can't nest work-sharing directives inside the same parallel region, you can only issue them sequentially unless you create a nested parallel region that spawns additional teams.
<br>
In other words, a team can only deal with one work-sharing directive at once.

**Nesting parallel regions** provides an immediate way to allow more threads to participate in the computation.<br>
Nested parallel region behavior:
- if nested parallelism is enabled (see later), each nested region will spawn its defined number of threads each time it is encountered by any thread. This can help programs with limited scalability, but also quickly blow up the number of threads past physical cores and bloat the system (oversubscription) if abused. Enabling dynamic adjustment of the number of threads (see later) can mitigate this.
- if nested parallelism is disabled, the nested parallel region executes as if it were a serial block.

Synchronization and exclusion:

- `#pragma omp single` : only one thread executes the block; other threads wait at an implicit barrier.
- `#pragma omp master` : executed only by the master thread (thread 0), with no barrier implied.
- `#pragma omp barrier` : explicit synchronization point; all threads wait here.
- `#pragma omp critical [(name)]` : only one thread executes the block at a time; multiple critical blocks that have the same name are seen as the same block and use the same lock underneath.
- `#pragma omp atomic` : perform a single atomic update on a shared variable.
- `#pragma omp flush` : enforce memory consistency (ensures all threads see updated values).
<!--- `#pragma omp ordered` : enforce ordered execution of certain loop parts marked as `ordered`.-->

(\*) Using `omp for` requires a canonical OpenMP loop, meaning that:
- it is strictly a `for` loop;
- it has a single integer loop induction variable;
- the loop is countable (finite), with a linear increment or decrement (e.g. `i += 2` is ok, but not `i *= 2`);
- the loop variable, bounds, and increment are iteration invariants;

Following from the above, the number of iterations can be determined before the loop executes, even if it's not known until run time.
In particular, note that the number of iterations doesn't need to be known statially (at compile time). It can be computed at runtime so long as it is fixed by the time the loop is reached and needs to be divided among threads.

---

### Scheduling

When parallelizing unbalanced loops, where some iterations may take more time than others, we can balance the load between threads with a schedule:

- `schedule(kind[, chunk_size])` : control iteration scheduling, `kind` can be:
  - `static` : iterations are split into chunks of equal size, each thread is assigned its almost evenly chunks before execution begins and those never change afterwards.
  - `dynamic` : iterations are split into chunks of equal size and placed in a queue, threads are assigned one chunk at a time from the queue. Adds a slight scheduling overhead, use only if the workload is truly unbalanced.
  - `guided` : chunks of iterations are assigned to threads as the loop runs, but the chunk size decreases as the loop progresses. The given `chunk_size` functions as the minimum chunk size reached. Also adds a slight overhead, but less than `dynamic` due to fewer scheduling decisions as per the larger initial chunks.
  - `runtime` : delegates the choice of schedule to the environment variable `OMP_SCHEDULE`.
  - `auto` : the choice of schedule is delegated to either the compiler or the runtime environment.
  - if not specified, the default schedule is implementation-dependent.
- `collapse(n)` : collapse (flatten) `n` nested loops into a single loop (and thus single iteration space) for scheduling.

**Keep this in mind when parallelizing nested loops:**<br>
If execution of any associated loop changes any of the values used to compute any of the **iteration counts** (loop bounds), then the behavior is unspecified.

---

### Tasks and Related Pragmas

Defines asynchronous units of work for fine-grained parallelism:

- `#pragma omp task` : define a task for deferred execution.
- `#pragma omp taskwait` : wait until all child tasks of the current task complete.
- `#pragma omp taskgroup` : group tasks for collective synchronization.
- `#pragma omp taskyield` : allow a thread to yield execution to other tasks.

Common clauses for tasks:

- `if(cond)` : create the task only if the condition is true; otherwise, execute it immediately.
- `final(cond)` : mark task as “final,” disallowing creation of child tasks inside it.
- `mergeable` : allow the task to be merged with its parent task for optimization.
- `depend(in|out|inout : varlist)` : declare task dependencies to control execution order.
- `untied` : allow the task to resume on a different thread than the one that started it.

Whereas sections define static tasks, statically defined at compile time and whose number cannot change, that are queued up and handled in arbitrary order by threads, these true **tasks** form a graph of execution with dependencies, any task spawning more tasks, and synchronization, more like threads would in a barebones fork-join model, but with cleaner code and higher level abstractions.
<br>
In brief: tasks can be spawned from any point and thread inside the parallel region!

A huge warning: tasks are tied to their thread by default, meaning that if they are suspended, they must resume in the same thread until they finish. This is crucial because private variables (e.g. those specified on the `parallel private(...)` that spawns the team of threads running the tasks) are **per-thread, not per-task**, so if a task relies on the private variables of its thread, and is untied, it may see those randomly changing if it ever gets rescheduled. And the same applies to other per-thread things, like IDs.
<br>
If you want to untie a task, make sure it only works on shared variables or its own local variables.

---

### Teams and Offloading (extra)

The `pragma opm teams` directive was introduced mainly to support heterogeneous (accelerator/GPU) programming.

It takes the place of `parallel`, but unlike it, `teams` creates multiple teams of threads, each with its own master and workers.
Each team executes the same code region independently.
Within each team, you can then launch further nested parallelism (using `omp parallel`) to create hierarchical parallelism.

On a CPU, this may just create a single team (depending on implementation), but on GPUs, **it naturally maps to CUDA thread and blocks** seen as multiple independent teams, each with their own threads.

Offloading may look like this:
```
#pragma omp target teams distribute parallel for device(device_num)
for (int i = 0; i < N; ++i) {
    A[i] = B[i] + C[i];
}
```

Where the pragma reads as:
- target : offload to a set device (e.g. GPU).
- teams : create multiple teams (like CUDA thread blocks).
- parallel for : within each team, create multiple threads to execute parts of the loop in parallel.

---

### Routines

Functions exposed by the OpenMP API:

- `int omp_get_thread_num()` : use inside a parallel region to get the unique identifier of your thread; it will be a number in [0, num_threads).
- `int omp_get_num_threads()` : returns the number of threads created in the current parallel region.
- `void omp_set_num_threads(int num_threads)` : sets the default number of threads to use in parallel regions.
- `int omp_get_max_threads()` : returns the maximum number of threads to use in parallel regions.
- `void omp_set_max_threads(int num_threads)` : sets the maximum number of threads to use in parallel regions.
- `void omp_set_nested(int nested)` : enables or disables nested parallelism.
- `void omp_set_dynamic(int dynamic_threads)` : enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions.
- `double omp_get_wtime()` : returns an absolute time reference, useful to time code execution.

Additional routines are available inside the `teams` pragma:
- `int omp_get_team_num()` : returns the number of created teams.
- `int omp_get_team_num()` : returns the unique identifier of the caller thread's team; it will be a number in [0, num_teams).

To access those you need to include the `omp.h` header.

---

### Environment Variables

Control OpenMP runtime behavior without recompiling.

- `OMP_NUM_THREADS` : default number of threads to use in parallel regions, if not specified defaults to the system's core count.
- `OMP_SCHEDULE` : default loop scheduling policy (e.g. `"dynamic,4"`).
- `OMP_PROC_BIND` : control thread-core binding (`master|close|spread`).
- `OMP_PLACES` : specify hardware places (`threads|cores|sockets` or custom lists).
- `OMP_MAX_ACTIVE_LEVELS` : maximum depth of nested parallel regions.
- `OMP_WAIT_POLICY` : set thread waiting behavior (`active` or `passive`).
- `OMP_DISPLAY_ENV` : print the current OpenMP environment at startup.
- `OMP_STACKSIZE` : set the thread stack size.
- `OMP_CANCELLATION` : enable or disable cancellation features in tasks or loops.
- `OMP_NESTED` : enables nested parallel regions, usually disabled by default.
- `OMP_DYNAMIC` : enables or disables dynamic adjustment of the number of threads available for the execution of subsequent parallel regions, usually disabled by default.(`TRUE` or `FALSE`).

---

### Controlling Affinity and Thread Assignment

Mechanisms to control how threads are bound to CPU cores and how their placement affects performance:

- `proc_bind(master|close|spread)` : directive clause controlling how threads are distributed within the available places (as defined by `OMP_PLACES`):
  - `master` : all threads are placed close to the master thread (usually on the same core group, e.g. socket or NUMA node).
  - `close` : threads are packed as near as possible to each other, filling one place before moving to the next (minimize distance, maximize data locality).
  - `spread` : threads are distributed as widely as possible across places (maximize distance, maximize resource usage).

At the environment level:

- `OMP_PROC_BIND` : controls whether and how threads are bound (`TRUE|FALSE|master|close|spread`).
- `OMP_PLACES` : specifies the hardware resources threads may be placed on (e.g. `threads`, `cores`, `sockets`, or custom lists like `"{0,1},{2,3}"`).

Typical uses:
- use `close` for threads that frequently share many accesses to the same data or work on contiguous chunks of a shared array, thus improving cache locality.
- use `spread` for threads that are largely independent or memory-bound, hence prefer having a lot of hardware resources and may as well fill a cache line by themselves with little data accesses in common with each other.
- with nested parallel regions, it's usual to have first a `spread` (to use different sockets or cores) and then a `close` binding (to exploit data reuse within a place), especially when inner threads see more data reuse opportunities than outer ones.

Example:

```
#pragma omp parallel num_threads(4) proc_bind(spread)
{
  #pragma omp parallel num_threads(4) proc_bind(close)
  {
    // Work here
  }
}
````

---

### Settings Precedence

When setting the same parameter through different means, they override each other in this order from most to least authoritative:

1. Explicit clauses in pragmas (`proc_bind`, `num_threads`, ...)
2. Explicit routine calls (`omp_set_num_threads`, ...)
3. Environment variables (`OMP_PROC_BIND`, `OMP_PLACES`, `OMP_NUM_THREADS`, ...)
4. Implementation defaults (compiler/runtime)

---

### Compiler Commands

Compilers (GCC, Clang) require a flag to enable OpenMP support:

```bash
gcc/clang -fopenmp program.c -o program
```

To disable OpenMP (e.g. for debugging), you can just omit `-fopenmp`.
<br>
When disabled, OpenMP pragmas are ignored, and the code runs in serial mode.
This is useful for checking correctness and debugging race conditions.

---

### Notes

* OpenMP pragmas are hints to the compiler: if support is disabled, they are ignored and the program remains valid C/C++.
* For portable and deterministic parallel programs, always explicitly specify data-sharing attributes and scheduling.

## **Simple Examples**

### **Exercise 1**

Multiple concurrent threads printing "Hello, World!" (plus meaningless computation to make it run longer):

In [8]:
%%writefile /home/OpenMP/hello_world_0.cpp
#include <stdio.h>
#include <omp.h>

int main ()
{
  #pragma omp parallel /*num_threads(2000)*/
  {
    int id = omp_get_thread_num();
    printf("Hello World from thread = %d\n", id);
  }
}

Overwriting /home/OpenMP/hello_world_0.cpp


Note how this is different from the Pthreads implementation. Questions:
- How many threads will be created? Usually as many as the available CPU cores, but this is implementation dependent. <br/> In this example, two threads are created as the default Colab runtime features a dual-core CPU.
- What happens if you ask for a number of threads greater than the number of cores? Oversubscription may happen, leading to lower performance and introducing a scheduling overhead.

Compile:

In [9]:
!g++ hello_world_0.cpp -fopenmp -o hello_world_0

Execute:

In [10]:
!./hello_world_0

Hello World from thread = 1
Hello World from thread = 0


### **Exercise 2**

Introducing parallelism in OpenMP can be as easy as adding pragmas, with no further modifications on the code. For example:

In [11]:
%%writefile /home/OpenMP/hello_world.cpp

#include <stdio.h>

void print_message(int threadIndex) {
  printf("Thread number %d\n", threadIndex);
}

int main() {
  #pragma omp parallel num_threads(4)
  {
    #pragma omp for schedule(static, 4)
    for (int ii = 0; ii < 10; ii++) {
      print_message(ii);
    }
  }
  return 0;
}

Writing /home/OpenMP/hello_world.cpp


For this example we need clang if we want to inspect the LLVM-IR...

In [None]:
%%capture
!apt install clang

Compile without the OpenMP flag:

In [None]:
%cd /home/OpenMP
!clang hello_world.cpp -o hello_world

/home/OpenMP


Inspect the generated LLVM IR:

In [None]:
%cd /home/OpenMP
!clang hello_world.cpp -S -emit-llvm
!cat hello_world.ll

/home/OpenMP
; ModuleID = 'hello_world.cpp'
source_filename = "hello_world.cpp"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.str = private unnamed_addr constant [18 x i8] c"Thread number %d\0A\00", align 1

; Function Attrs: mustprogress noinline optnone uwtable
define dso_local void @_Z13print_messagei(i32 noundef %0) #0 {
  %2 = alloca i32, align 4
  store i32 %0, i32* %2, align 4
  %3 = load i32, i32* %2, align 4
  %4 = call i32 (i8*, ...) @printf(i8* noundef getelementptr inbounds ([18 x i8], [18 x i8]* @.str, i64 0, i64 0), i32 noundef %3)
  ret void
}

declare i32 @printf(i8* noundef, ...) #1

; Function Attrs: mustprogress noinline norecurse optnone uwtable
define dso_local noundef i32 @main() #2 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  store i32 0, i32* %1, align 4
  store i32 0, i32* %2, align 4
  br label %3

3:                                                ; preds = %8,

Compile with the OpenMP flag:

In [None]:
%cd /home/OpenMP
!clang hello_world.cpp -fopenmp -lstdc++ -o hello_world

/home/OpenMP


Inspect the generated LLVM IR:

In [None]:
%cd /home/OpenMP
!clang hello_world.cpp -S -emit-llvm -fopenmp
!cat hello_world.ll

/home/OpenMP
; ModuleID = 'hello_world.cpp'
source_filename = "hello_world.cpp"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

%struct.ident_t = type { i32, i32, i32, i32, i8* }

$__clang_call_terminate = comdat any

@.str = private unnamed_addr constant [18 x i8] c"Thread number %d\0A\00", align 1
@0 = private unnamed_addr constant [23 x i8] c";unknown;unknown;0;0;;\00", align 1
@1 = private unnamed_addr constant %struct.ident_t { i32 0, i32 514, i32 0, i32 22, i8* getelementptr inbounds ([23 x i8], [23 x i8]* @0, i32 0, i32 0) }, align 8
@2 = private unnamed_addr constant %struct.ident_t { i32 0, i32 66, i32 0, i32 22, i8* getelementptr inbounds ([23 x i8], [23 x i8]* @0, i32 0, i32 0) }, align 8
@3 = private unnamed_addr constant %struct.ident_t { i32 0, i32 2, i32 0, i32 22, i8* getelementptr inbounds ([23 x i8], [23 x i8]* @0, i32 0, i32 0) }, align 8

; Function Attrs: mustprogress noinline optno

Execute:

In [None]:
%cd /home/OpenMP
!./hello_world

/home/OpenMP
Thread number 0
Thread number 1
Thread number 8
Thread number 9
Thread number 4
Thread number 5
Thread number 6
Thread number 7
Thread number 2
Thread number 3


## **Calculation of pi**

###**Exercise 3**

Integral-based method to calculate pi: each thread calculates the heigth of a set of rectangles (map/SIMD pattern), the sum of all heigths is multiplied by the step size to get the area.
<img align="middle" src="https://drive.google.com/uc?id=17dBhvYY9F5Bl2re_pnmRWiZ717jolCPg">

Why does this work?
<br>
Recall that: $\frac{d}{dx} arctan(x) = \frac{1}{1+x^2}$ and $arctan(0) = 0°$ while $arctan(1) = 45° = \frac{180°}{4}$, so...

*More info here: https://math.stackexchange.com/questions/1085653/geometrical-interpretation-of-pi-int-01-frac41x2dx*

Basic implementation with manual work-sharing using thread IDs:

In [None]:
%%writefile /home/OpenMP/integralpi.cpp
#include <stdio.h>
#include <omp.h>

#define MAX_THREADS 4

static long num_steps = 100000000;
double step;

int main() {
	int i, j;
	double pi, full_sum = 0.0;
	double start_time, run_time;
	double sum[MAX_THREADS];

	step = 1.0/(double) num_steps;

	// measure scalability from 1 to MAX_THREADS threads
	for (j = 1; j <= MAX_THREADS; j++) {
		omp_set_num_threads(j);
		full_sum = 0.0;
		start_time = omp_get_wtime();

		#pragma omp parallel
		{
			int i;
			int id = omp_get_thread_num();
			int numthreads = omp_get_num_threads();
			double x;
			sum[id] = 0.0;
			if (id == 0)
				printf(" num_threads = %d", numthreads);

			// manual work allocation
			for (i = id; i < num_steps; i += numthreads) {
				x = (i + 0.5)*step;
				sum[id] = sum[id] + 4.0/(1.0 + x*x);
			}
		}

		for(full_sum = 0.0, i = 0; i < j; i++)
			full_sum += sum[i];

		pi = step * full_sum;
		run_time = omp_get_wtime() - start_time;
		printf("\n pi is %f in %f seconds %d threads \n", pi, run_time, j);
	}
}

Writing /home/OpenMP/integralpi.cpp


Compile and run:

In [None]:
!g++ integralpi.cpp -fopenmp -o integralpi
!./integralpi

 num_threads = 1
 pi is 3.141593 in 0.580693 seconds 1 threads 
 num_threads = 2
 pi is 3.141593 in 0.492389 seconds 2 threads 
 num_threads = 3
 pi is 3.141593 in 0.537672 seconds 3 threads 
 num_threads = 4
 pi is 3.141593 in 0.530303 seconds 4 threads 


Questions:
- How is work distributed among threads?
<!--using thread IDs, each step every "numthreads" is assigned to a different thread-->
- Is the result deterministic?
<!--yes-->
- How do you expect performance to scale with the number of threads?
<!--ideally, linearly, as we will almost always have far more iterations to distribute than threads and there is little overhead for the creation of additional threads, aside from their creation itself; on Colab tho, anything could happen-->

Implementation with the `parallel for` work-sharing construct:

In [None]:
%%writefile /home/OpenMP/integralpi2.cpp
#include <stdio.h>
#include <omp.h>

static long num_steps = 100000000;
double step;

int main() {
	int i, j;
	double x, pi, sum = 0.0;
	double start_time, run_time;

	step = 1.0/(double) num_steps;

	for (j = 1; j <= 4; j++) {
		sum = 0.0;
		omp_set_num_threads(i);
		start_time = omp_get_wtime();

		#pragma omp parallel for private(x) reduction(+:sum)
		for (i = 1; i <= num_steps; i++) {
			x = (i-0.5)*step;
			sum = sum + 4.0/(1.0+x*x);
		}

		pi = step * sum;
		run_time = omp_get_wtime() - start_time;
		printf("\n pi is %f in %f seconds and %d threads\n", pi, run_time, j);
	}
}

Overwriting /home/OpenMP/integralpi2.cpp


Compile and run:

In [None]:
!g++ integralpi2.cpp -fopenmp -o integralpi2
!./integralpi2


 pi is 3.141593 in 0.475248 seconds and 1 threads

 pi is 3.141593 in 0.346572 seconds and 2 threads

 pi is 3.141593 in 0.350604 seconds and 3 threads

 pi is 3.141593 in 0.364947 seconds and 4 threads


Questions:
- How is work distributed among threads?
<!--each thread is statically assigned some of the loop's iterations by the "for" pragma-->
- Are there other ways of resolving the access to the shared variable?
<!--yes, atomically incrementing "sum", even tho it's a poor idea performance-wise-->

## **Variables Initialization**

### **FirstPrivate**

Whenever we need to quickly give a copy of a value to each thread, we use `firstprivate`, this saves us the time needed for each thread to go and fetch a copy of an otherwise shared variable.

Say that we have an array of elements and an initial value.
We need to find subsequences of 3 contigous values in the array that such that, if added to the initial one, overflow.
We can dispatch the initial value to threads as a firstprivate variable.

In [None]:
%%writefile /home/OpenMP/fistprivate.cpp
#include <stdio.h>
#include <omp.h>
#include <limits.h>

int main(void) {
  const int N = 12;
  unsigned int data[N] = {10, 20, 30, 250, 5, 10, 100, 200, 50, 90, 200, 40};
  unsigned int init_value = 1<<30;

  #pragma omp parallel for firstprivate(init_value)
  for (int i = 0; i < N - 2; i++) {
    unsigned int a = data[i];
    unsigned int b = data[i + 1];
    unsigned int c = data[i + 2];

    unsigned int sum = init_value;
    int overflow = 0;

    if (sum > UINT_MAX - a) overflow = 1;
    else sum += a;
    if (!overflow && sum > UINT_MAX - b) overflow = 1;
    else sum += b;
    if (!overflow && sum > UINT_MAX - c) overflow = 1;
    else sum += c;

    if (overflow)
      printf("Thread %d found overflow at subsequence [%d,%d,%d]\n", omp_get_thread_num(), i, i+1, i+2);
  }
}

Questions:
- could we do this without `firstprivate`?
<!--obviously yes, we could just have each thread copy the content of the then-shared "init_value" into its own local variable-->
- what is the advantage of using `firstprivate`?
<!--firstprivate guarantees that threads will not alter the global instance of the variable. It is mainly a semantical tool to ensure that when comparing the code between a parallel and serial execution the existence of threads doesn't unpredictably change the content of the thus-private variable. Ultimately, firstprivate clearly explicitate how the program intends the variable to be (safely) operated upon by threads: each thread receives its own copy of this initial value.-->

### **LastPrivate**

With sections (or iterations of a for loop) and `lastprivate`, we can ensure that a variable is updated by the last section (or iterations) in serial program order, not whichever finishes last in real time.

Arguably, lastprivate is very rarely useful...

In [None]:
%%writefile /home/OpenMP/lastprivate.cpp
#include <stdio.h>
#include <omp.h>

int main() {
  int x = 0;

  #pragma omp parallel sections lastprivate(x)
  {
    #pragma omp section
    { x = 1; printf("Section 1: x=%d\n", x); }

    #pragma omp section
    { x = 2; printf("Section 2: x=%d\n", x); }

    #pragma omp section
    { x = 3; printf("Section 3: x=%d\n", x); }
  }

  printf("After sections, x=%d (from the *last* section)\n", x);
  return 0;
}

## **Linked List Traversal**

###**Exercise 4**

Parallelizing the traversal of a linked list with OpenMP can be highly inefficient.
<img align="middle" src="https://drive.google.com/uc?id=1BrtuiwIzR2Y-xGPUt3lIX88zEfVF7e_L">
The *task* construct provides a better way to dynamically create concurrent work units.

In [None]:
%%writefile /home/OpenMP/linkedlist.cpp
#include <omp.h>
#include <stdlib.h>
#include <stdio.h>

struct node {
  int data;
  int fibdata;
  struct node* next;
};

struct node* init_list(struct node* p);
void processwork(struct node* p);
int fib(int n);

int fib(int n) {
  int x, y;
  if (n < 2) {
    return (n);
  } else {
    x = fib(n - 1);
    y = fib(n - 2);
    return (x + y);
  }
}

void processwork(struct node* p) {
  int n, temp;
  n = p->data;
  temp = fib(n);
  p->fibdata = temp;
}

struct node* init_list(struct node* p) {
  int i;
  struct node* head = NULL;
  struct node* temp = NULL;

  head = malloc(sizeof(struct node));
  p = head;
  p->data = 38;
  p->fibdata = 0;
  for (i = 0; i < 5; i++) {
    temp  = malloc(sizeof(struct node));
    p->next = temp;
    p = temp;
    p->data = 38 + i + 1;
    p->fibdata = i + 1;
  }
  p->next = NULL;
  return head;
}

int main() {
  double start, end;
  struct node *p=NULL;
  struct node *temp=NULL;
  struct node *head=NULL;

  printf("Process linked list\n");
  printf("  Each linked list node will be processed by function 'processwork()'\n");
  printf("  Each node will compute a subsequent Fibonacci number starting from the 38th\n");

  p = init_list(p);
  head = p;

  start = omp_get_wtime();

  #pragma omp parallel num_threads(4)
  {
    #pragma omp master
    printf("Threads: %d\n", omp_get_num_threads());
    #pragma omp single
    {
      printf("I am thread %d and I am creating tasks\n", omp_get_thread_num());
      p = head;
      while (p) {
        #pragma omp task firstprivate(p) // each task gets is own copy of the pointer to the current list node
        {
          processwork(p);
          printf("I am thread %d\n", omp_get_thread_num());
        }
        p = p->next;
      }
    }
  }

  end = omp_get_wtime();
  p = head;
  while (p != NULL) {
    printf("%d : %d\n",p->data, p->fibdata);
    temp = p->next;
    free (p);
    p = temp;
  }
  free (p);

  printf("Compute Time: %f seconds\n", end - start);
}

Questions:
- How many threads create tasks?
<!--one, it's inside the "single" pragma-->
- How are tasks distributed among threads?
<!--workpile-style, each thread fetches a new task every time it finishes the current one-->

## **Task Graphs**

Recall:
- $work$ : total amount of work
- $span$ : work on the critical path
- $parallelism = work / span$

Room for optimization:
- reducing the critical path
- reducing overhead for anything that is not on the critical path

Representation:
- nodes: tasks with a certain amount of work to do
- directed edges: dependencies between tasks (inbound arrows: what the current task needs to wait for)

For completeness, let's also recap a bit of terminology:
- when a task creates another task, the creating task becomes the *parent task* of the new task. The new task then is called a *child task* of its parent task. - the term *sibling tasks* refers to all tasks that have the same parent.
- a *descendant task* is a task in the ancestor chain of a parent, so either a child task or a task created by a descendant task (e.g., a child task of a child task).
- if a new task is put into the task pool, it is said to be *deferred* while if it is executed straight away, it is *undeferred*.
- a task is described as *completed* when it has been scheduled for execution and that execution has finished.

### **Exercise 5**

Given the following task graph:

&nbsp;

<img align="center" src="https://drive.google.com/uc?id=17PNFB2oQAFEHfvQPflSieUmjQpmLGAmn">

&nbsp;

- Calculate work and parallelism. <!--span=100+250+300, work=860, parallelism=1.32-->
- Write an OpenMP implementation reflecting the structure of the task graph.
- How many threads are active during the execution of Task 5? <!--1 or 2, depending on the state of T4-->
- Is there a better parallel implementation (considering both performance and resource usage)? I.e. do we really need to exploit all parallelism? <!--since W(T4) > W(T2) + W(T3), there is no need to run T2 and T3 in parallel (with the overhead of constructing one more thread), we could just run the sequentially T2 -> T3 and we would still be bond by the amount of work done by T4-->

### **Exercise 6**

Given the following task graph:

&nbsp;

<img align="center" src="https://drive.google.com/uc?id=15gkHj2zWAZQnuLhmPMWWSjWZqOfkj3MB">

&nbsp;

- Calculate work and span. <!--span=670, work=1045, parallelism=1.56-->
- Write an OpenMP implementation reflecting the structure of the task graph.
- Is this implementation faster than a sequential one? <!--yes, we have parallelism > 1 after all-->


### **Exercise 7**

Given the following task graph:

&nbsp;

<img align="center" src="https://drive.google.com/uc?id=1PXlO9Eaxu1lI27k2hYIaWD6y6tLZqDi9">

&nbsp;

Assume W(T1)=100, W(T2)=100, W(T3)=75, W(T4)=50, W(T5)=75, W(T6)=100, W(T7)=100, W(T8)=200.

- Calculate work and span. <!--span=400, work=800, parallelism=2-->
- Write an OpenMP implementation reflecting the structure of the task graph.
- How many threads are needed to achieve the maximum theoretical parallelism? <!--
2 threads!
For example, thread 1 handles T1, thread 2 handles T2; then thread 1 does T3, T4, and T5 while thread 2 handle T8, then thread 1 handles T6 and thread 2 finishes T7.
Note: you can always get away with "ceil(parallelism)" threads so long as you assume that a thread can "yield" a task, go do some other work, and resume it later. In this case this was not needed since luckly W(T3)+W(T4)+W(T5) = W(T8).
-->

### **Exercise 8**

Given the following task graph:

&nbsp;

<img align="center" src="https://drive.google.com/uc?id=14BBD6_ctJ-IUH1ubnaF4pMh99LNjX818">

&nbsp;

Assume W(T1)=50, W(T2)=50, W(T3)=200, W(T4)=75, W(T5)=75, W(T6)=100, W(T7)=100, W(T8)=200.

- Which dashed arrow prevents from implementing such a task structure with OpenMP? <!--none, there are still no cycles, though the dependency T6->T8 is redundant with T6->T7->T8-->
- Remove the dashed arrows, calculate work and span. <!--span=600, work=850, parallelism=1.42-->
- Write an OpenMP implementation reflecting the structure of the task graph.

### **Exercise 9**

Given the following task graph:

&nbsp;

<img align="center" src="https://drive.google.com/uc?id=1flOJlQU5-jo4kLeJmBBuou7HqYYYvQZI">

&nbsp;

Assume W(T1)=50, W(T2)=50, W(T3)=50, W(T4)=75, W(T5)=75, W(T6)=100, W(T7)=100, W(T8)=200.

<!--note the redundant dependency T3->T7-->
- Calculate work and span. <!--span=500, work=700, parallelism=1.4-->
- Write an OpenMP implementation reflecting the structure of the task graph.
- How many threads are needed to run the program? <!--ceil(1.4)=2, while T3 and T7 run by themselves in the above path, T4 can be handled by the other thread-->
- How many threads could be active during the execution of Task 3? How many during Task 5? <!--during T3, at most 2 (another on T4), during T5, at most 3 (another on T6 and another on T4)-->

## **Deadlocks 1o1**

A very simple example of how NOT to use barriers.
<br>
Also not how each thread can acquire its own ID. This is written in C++ just to spice things up.

In [14]:
%%writefile /home/OpenMP/deadlock.cpp
#include <iostream>
#include <omp.h>

int main() {
  #pragma omp parallel default(none) shared(std::cout) num_threads(4)
  {
    const int thread_num = omp_get_thread_num();

    if(thread_num == 0) {
      std::cout << "I'm thread 0 and I caused a deadlock!" << std::endl;
    } else {
      #pragma omp barrier
    }

    #pragma omp critical
    std::cout << "I'm thread " << thread_num << std::endl;
  }
}

Writing /home/OpenMP/deadlock.cpp


Compile and run with a timeout (since we know it will deadlock):

In [15]:
!g++ deadlock.cpp -fopenmp -o deadlock
!timeout 4s ./deadlock && echo "Program finished normally." || echo "Program was killed by timeout."

I'm thread 0 and I caused a deadlock!
I'm thread 0
Program was killed by timeout.


The compile often helps you, however. Something this horrific will not even compile:

In [21]:
%%writefile /home/OpenMP/horrific_example.cpp
#include <iostream>
#include <omp.h>

int main() {
#pragma omp parallel
{
#pragma omp single
{
  std::cout << "I've caused a deadlock!\n";
  #pragma omp barrier
}
}
return 0;
}

Overwriting /home/OpenMP/horrific_example.cpp


In [22]:
!g++ horrific_example.cpp -fopenmp -o horror

[01m[Khorrific_example.cpp:[m[K In function ‘[01m[Kint main()[m[K’:
[01m[Khorrific_example.cpp:10:61:[m[K [01;31m[Kerror: [m[Kbarrier region may not be closely nested inside of work-sharing, ‘[01m[Kloop[m[K’, ‘[01m[Kcritical[m[K’, ‘[01m[Kordered[m[K’, ‘[01m[Kmaster[m[K’, explicit ‘[01m[Ktask[m[K’ or ‘[01m[Ktaskloop[m[K’ region
   10 |   #pragma omp barrier // <-- the compiler errors out on this
      |                                                             [01;31m[K^[m[K


## **Loop Schedules**

Let's see the effect of different loops schedules on an unbalanced loop nest:

In [None]:
%%writefile /home/OpenMP/schedules.cpp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define N 5000 // outer loop iterations
#define M 5000 // inner loop iterations

#define CHUNK 1000 // chunk size
#define THREADS 4  // number of threads

int main(void) {
  omp_set_nested(true); // enable nested parallelism
  omp_set_num_threads(THREADS);

  double start, end;
  double total = 0.0;

  printf("OpenMP Nested Loop Parallelization Comparison\n");
  printf("N = %d, M = %d\nCHUNK = %d, THREADS = %d \n\n", N, M, CHUNK, THREADS);

  // nested work-sharing with static schedule
  total = 0.0;
  start = omp_get_wtime();
  #pragma omp parallel for schedule(static)
  for (int i = 0; i < N; i++) {
    double local_sum = 0.0;
    // Note: we CANNOT put another work-sharing construct here like:
    // #pragma omp for schedule(static)
    // OpenMP forbids two work-sharing constructs back-to-back without a barrier or parallel pragma in-between!
    // The intended way to achieve the same result is the 'collapse' clause, see version 4.
    for (int j = 0; j < M; j++) {
      // fake unbalance: the amount of work depends on i
      for (int k = 0; k < (i % 50 + 1); k++) {
        local_sum += (i * j + k) * 1e-6;
      }
    }
    #pragma omp atomic
    total += local_sum;
  }
  end = omp_get_wtime();
  printf("Version 1 (static, nested for): %f seconds\n", end - start);

  // nested work-sharing with dynamic schedule
  total = 0.0;
  start = omp_get_wtime();
  #pragma omp parallel for schedule(dynamic)
  for (int i = 0; i < N; i++) {
    double local_sum = 0.0;
    for (int j = 0; j < M; j++) {
      for (int k = 0; k < (i % 50 + 1); k++) {
        local_sum += (i * j + k) * 1e-6;
      }
    }
    #pragma omp atomic
    total += local_sum;
  }
  end = omp_get_wtime();
  printf("Version 2 (dynamic, nested for): %f seconds\n", end - start);

  // nested parallelization with dynamic schedule
  total = 0.0;
  start = omp_get_wtime();
  #pragma omp parallel for schedule(static) reduction(+:total)
  for (int i = 0; i < N; i++) {
    // Note: here we can do this, becase we create a nested parallel region
    #pragma omp parallel for schedule(static) reduction(+:total)
    for (int j = 0; j < M; j++) {
      for (int k = 0; k < (i % 50 + 1); k++) {
        total += (i * j + k) * 1e-6;
      }
    }
  }
  end = omp_get_wtime();
  printf("Version 3 (dynamic, nested parallel for): %f seconds\n", end - start);

  // collapse clause, single parallel for with dynamic schedule
  total = 0.0;
  start = omp_get_wtime();
  #pragma omp parallel for collapse(2) schedule(dynamic, CHUNK)
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
      for (int k = 0; k < (i % 50 + 1); k++) {
        total += (i * j + k) * 1e-6;
      }
    }
  }
  end = omp_get_wtime();
  printf("Version 4 (dynamic(CHUNK), collapse(2)): %f seconds\n", end - start);

  // collapse clause, single parallel for with guided schedule
  total = 0.0;
  start = omp_get_wtime();
  #pragma omp parallel for collapse(2) schedule(guided)
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
      for (int k = 0; k < (i % 50 + 1); k++) {
        total += (i * j + k) * 1e-6;
      }
    }
  }
  end = omp_get_wtime();
  printf("Version 5 (guided, collapse(2)): %f seconds\n", end - start);
}


Overwriting /home/OpenMP/schedules.cpp


Compile and run:

In [None]:
!g++ schedules.cpp -fopenmp -o schedules
!./schedules

OpenMP Nested Loop Parallelization Comparison
N = 5000, M = 5000
CHUNK = 1000, THREADS = 4 

Version 1 (static, nested for): 1.903383 seconds
Version 2 (dynamic, nested for): 1.865860 seconds
Version 3 (dynamic, nested parallel for): 3.145333 seconds
Version 4 (dynamic(CHUNK), collapse(2)): 4.494570 seconds
Version 5 (guided, collapse(2)): 4.864830 seconds


**Version 1**:
<br>
Only the outer parallel creates a team of threads.

- Low overhead, only one parallel team created.
- Static scheduling ensures predictable, reproducible thread assignment.
- Poor load balancing.

**Version 2**:
<br>
Still a single team of threads.
Dynamic scheduling lets threads grab new chunks as they finish, compensates the unbalaned workload.

- Load balancing.
- More scheduling overhead than static.

**Version 3**:
<br>
The outer parallel for creates one team of threads.
Each outer thread that hits the inner parallel for spawns a new inner team.

- Full control of both loop levels, each loop can scale and be scheduled independently.
- Can exploit more cores if your hardware and OpenMP's runtime support nested teams efficiently.
- High overhead: every outer iteration spawns a new parallel region.
- Memory and scheduling overhead outweigh benefits for small and medium-sized workloads.
- Likely to oversubscribe cores.

**Version 4**:
<br>
The two loops are merged into a single iteration space of size N*M.
One parallel team handles all (i, j) pairs directly.

- More fine grained load balancing when combined with the dynamic schedule.
- Scheduling overhead can be more effectively controlled with larger chunk sizes.
- Only works if the nested loops are perfectly nested and independent.

**Version 5**:
<br>
Same as version 4, but guided scheduling means that when there is a lot of work left to do, each thread is assigned a larger portion of it, with smaller and smaller portions being assigned as less work remains.

- Very fine grained load balancing.
- Extremely reduced scheduling overhead, as we issue more jobs only towards the end, to avoid leaving some threads at idle.

Version 5 is the cleanest and often fastest option.
<br>
*Note: this may not show on Colab, as it gives us only 2 mere cores...*

*Note: in version 1 and 2, a reduction over `local_sum` is not needed, because the inner `#pragma omp for` is not creating a new parallel region, it is simply another work-sharing construct within the same parallel team spawned by the outer pragma.*

Questions:

If you are told that exactly one iteration every 10 needs to do 5x the work, and you have thousands of iterations, what is the best schedule and why?
<!--"static", with batch multiple of 10, large enough to exploit caches, but not too large as not to cause severe unbalance in the number iterations given to each thread-->

Assume now a workload consisting of roughly 2-3x as many iterations as there will be threads (e.g. 25 iterations, 10 threads), you know that each iteration has a one-in-three chance of requiring twice the amount of work as would a normal iteration. What schedule would work best and why?
<!--"dynamic", chunks of 1, atmost 2, iterations, because with this little iterations the scheduling overhead is negligible and using larger chunks or "guided" could not effectively mitigate the unbalance due to the high likelyhood of heavier iterations-->

## **Déjà-vu**

### **Vector Product**

Just another way to see yet another reduce, but now with OpenMP pragmas!

In [None]:
%%writefile /home/OpenMP/vector_vector_prod.cpp
#include <stdio.h>
#include <omp.h>

int main() {
  int N = 1000000;

  double *A = (double*) malloc(N * sizeof(double));
  double *B = (double*) malloc(N * sizeof(double));

  // initialize vectors
  for (int i = 0; i < N; i++) {
    A[i] = i * 0.001;
    B[i] = (N - i) * 0.002;
  }

  double dot = 0.0;
  double start_time = omp_get_wtime();

  // parallel reduce
  #pragma omp parallel for reduction(+ : dot) schedule(static)
  for (int i = 0; i < N; i++) {
    dot += A[i] * B[i];
  }

  double end_time = omp_get_wtime();

  printf("Dot product = %.5f\n", dot);
  printf("Computed in %.5f seconds using %d threads.\n",
  end_time - start_time, omp_get_max_threads());

  free(A);
  free(B);
}


### **MatMul**

Let's just make each thread perform a vector-vector product!
<br>
However, we can try to exploit thread affinity to slightly improve cache performance by manually tiling the loop!
<br>
Say for example that we have a dual-socket motherboard with two 8-core CPUs, with each CPU itself divided in two NUMA nodes (e.g. each group of 4 cores has its own L2 cache and dedicated DRAM channel).
<br>
It's better if we split the work in 16 chunks, space far and wide 4 groups of those chunks, and then pull close the 4 chunks in each group.
This equates to cutting the output matrix's rows and cols in 4 equi-sized groups. This results in 16 chunks. Then we give each of the 4 quadrants of the output matrix to a different NUMA node and each quarter of the quadrant to a thread.

*Note: this is FAR from the best way to do a matmul on CPU! There are several improvements among which "blocking", a similar idea to tiling to better exploit caches!*
<br>
*More on this here:*
- *BLAS: https://www.netlib.org/blas/*
- *BLIS: https://dl.acm.org/doi/10.1145/2764454*
- *More on BLIS: https://dl.acm.org/doi/10.1145/2925987*


In [None]:
%%writefile /home/OpenMP/matmul.cpp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

void init_matrix(double *M, int N, double scale) {
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
      M[i * N + j] = scale * ((i + j) % 100);
}

int main(int argc, char *argv[]) {
  int N = 1024;
  if (argc > 1) N = atoi(argv[1]);

  printf("Matrix size: %d x %d\n", N, N);

  double *A = (double*) malloc(N * N * sizeof(double));
  double *B = (double*) malloc(N * N * sizeof(double));
  double *C = (double*) calloc(N * N, sizeof(double));

  init_matrix(A, N, 0.01);
  init_matrix(B, N, 0.02);

  int groups = 4;            // outer level (NUMA groups)
  int threads_per_group = 4; // inner level threads per group
  int tile_size = N / 4;     // 4 x 4 grid of tiles

  omp_set_nested(1); // enable nested parallelism

  double t1 = omp_get_wtime();

  // outer parallel region: spread affinity
  #pragma omp parallel num_threads(groups) proc_bind(spread)
  {
    int gi = omp_get_thread_num(); // group index
    int i_start = gi * tile_size;
    int i_end   = (gi + 1) * tile_size;

    // inner parallel region: close affinity within group
    #pragma omp parallel num_threads(threads_per_group) proc_bind(close)
    {
      int gj = omp_get_thread_num(); // tile index within group
      int j_start = gj * tile_size;
      int j_end = (gj + 1) * tile_size;

      for (int i = i_start; i < i_end; i++) {
        for (int j = j_start; j < j_end; j++) {
          double sum = 0.0;
          for (int k = 0; k < N; k++) {
            sum += A[i * N + k] * B[k * N + j];
          }
          C[i * N + j] = sum;
        }
      }
    }
  }

  double t2 = omp_get_wtime();
  printf("Execution time: %.3f s\n", t2 - t1);

  free(A); free(B); free(C);
}


Writing /home/OpenMP/matmul.cpp


Compile and run:

In [None]:
!g++ matmul.cpp -fopenmp -o matmul
!./matmul

Matrix size: 1024 x 1024
Execution time: 14.621 s


### **Histogram**

Shared memory and atomics, here we go again!
<br>
Remember to rely on privatization!

In [2]:
%%writefile /home/OpenMP/histogram.cpp
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <omp.h>

#define NUM_LETTERS 26
#define BIN_SIZE 4
#define NUM_BINS ((NUM_LETTERS + BIN_SIZE - 1) / BIN_SIZE)

void generate_random_string(char *s, size_t len) {
  for (size_t i = 0; i < len; i++)
    s[i] = 'a' + (rand() % NUM_LETTERS);
  s[len] = '\0';
}

int main(int argc, char *argv[]) {
  size_t N = 10000000;

  char *text = (char*) malloc(N + 1);
  srand(42);
  generate_random_string(text, N);

  // global copy
  int global_hist[NUM_BINS] = {0};

  double t1 = omp_get_wtime();

  #pragma omp parallel
  {
    // private copy
    int local_hist[NUM_BINS] = {0};

    #pragma omp for
    for (size_t i = 0; i < N; i++) {
      char c = text[i];
      if (c >= 'a' && c <= 'z') {
        int bin = (c - 'a') / BIN_SIZE;
        local_hist[bin]++;
      }
    }

    // merge private copies atomically
    for (int b = 0; b < NUM_BINS; b++) {
      #pragma omp atomic
      global_hist[b] += local_hist[b];
    }
  }

  double t2 = omp_get_wtime();
  printf("Execution time: %.4f s using %d threads\n", t2 - t1, omp_get_max_threads());

  // print histogram
  printf("\nHistogram (%u-letter bins):\n", BIN_SIZE);
  for (int b = 0; b < NUM_BINS; b++) {
    char start = 'a' + b * BIN_SIZE;
    char end   = start + BIN_SIZE - 1;
    if (end > 'z') end = 'z';
      printf("  %c-%c : %d\n", start, end, global_hist[b]);
  }

  free(text);
}


Writing /home/OpenMP/histogram.cpp


Compile and run:

In [4]:
!g++ histogram.cpp -fopenmp -o histogram
!./histogram

Execution time: 0.0370 s using 2 threads

Histogram (4-letter bins):
  a-d : 1538630
  e-h : 1538792
  i-l : 1538822
  m-p : 1539186
  q-t : 1536136
  u-x : 1539567
  y-z : 768867


Question:
- what would be a good schedule for the parallel for? <!--static (at most guided) since all iterations are perfectly balanced, we just need to worry about maximizing the cache hits of each thread, so chunks should be reasonably larger while not becoming uneven-->
- is privatization always the best option? <!--no, say that you are counting the occurrencies of 10M items, that are uniformly distributed in the input, then atomic updates on a single shared copy could be fast enough to justify saving the cost of replicating all counters-->

## **Sections and Critical Sections**

There are two very apparent ways to optimize this code:

In [None]:
%%writefile /home/OpenMP/two_loops.cpp
#include <cstdio>
#include <omp.h>
#include <vector>
#include <cmath>

using namespace std;

int main() {
  const int N = 10000000;
  std::vector<double> A(N, 0.0), B(N, 0.0);
  double global_sum_A = 0.0;
  double global_sum_B = 0.0;

  double start = omp_get_wtime();

  #pragma omp parallel default(none) shared(A, B, N, global_sum_A, global_sum_B)
  {
    double local_sum = 0.0;

    // first loop: compute A[i]
    #pragma omp for schedule(static)
    for (int i = 0; i < N; ++i) {
      A[i] = std::sin(i * 0.001);
      local_sum += A[i];
    }

    // conditional accumulation on A
    #pragma omp critical
    {
      if (global_sum_A < 100.0)
        global_sum_A += local_sum;
    }

    // synchronize before the next loop
    #pragma omp barrier
    local_sum = 0.0;

    // second loop: compute B[i]
    #pragma omp for schedule(static)
    for (int i = 0; i < N; ++i) {
      B[i] = std::cos(i * 0.001);
      local_sum += B[i];
    }

    // conditional accumulation on B
    #pragma omp critical
    {
      if (global_sum_B < 100.0)
        global_sum_B += local_sum;
    }
  }

  double end = omp_get_wtime();

  printf("Global sums: A = %.3f, B = %.3f\n", global_sum_A, global_sum_B);
  printf("Total time: %.4f seconds\n", end - start);
}

Overwriting /home/OpenMP/two_loops.cpp


Observe that the two loops (and subsequent critical sections) have no dependency with one-another, they could run concurrently!

What we need to do is:
- run each loop in its own parallel section
- remove the barrier between them
- name the two critical sections to make them independent
- now however we shared work among sections, to also share the work of each loop's iterations, we need a nested parallel region for each loop

In [None]:
%%writefile /home/OpenMP/two_loops.cpp
#include <cstdio>
#include <omp.h>
#include <vector>
#include <cmath>

using namespace std;

int main() {
  omp_set_nested(true);

  const int N = 10000000;
  std::vector<double> A(N, 0.0), B(N, 0.0);
  double global_sum_A = 0.0;
  double global_sum_B = 0.0;

  double start = omp_get_wtime();

  #pragma omp parallel default(none) shared(A, B, N, global_sum_A, global_sum_B)
  {
    #pragma omp sections
    {
      // first loop: compute A[i]
      #pragma omp section
      {
        double local_sum_A = 0.0;
        #pragma omp parallel for reduction(+:local_sum_A) schedule(static)
        for (int i = 0; i < N; ++i) {
          A[i] = std::sin(i * 0.001);
          local_sum_A += A[i];
        }

        #pragma omp critical (acc_A)
        {
          if (global_sum_A < 100.0)
            global_sum_A += local_sum_A;
        }
      }

      // second loop: compute B[i]
      #pragma omp section
      {
        double local_sum_B = 0.0;
        #pragma omp parallel for reduction(+:local_sum_B) schedule(static)
        for (int i = 0; i < N; ++i) {
          B[i] = std::cos(i * 0.001);
          local_sum_B += B[i];
        }

        #pragma omp critical (acc_B)
        {
          if (global_sum_B < 100.0)
            global_sum_B += local_sum_B;
        }
      }
    }
  }

  double end = omp_get_wtime();

  printf("Global sums: A = %.3f, B = %.3f\n", global_sum_A, global_sum_B);
  printf("Total time: %.4f seconds\n", end - start);
}

Overwriting /home/OpenMP/two_loops.cpp


Compile and run:

In [None]:
!g++ two_loops.cpp -fopenmp -o two_loops
!./two_loops

Global sums: A = 1952.308, B = -304.638
Total time: 0.4028 seconds


## **Tasks**

### **MergeSort**

### **Algebraic Expression Evaluation**

Let's evaluate this linar algebra expression with OpenMP tasks:

$A, B, C, D, E \in \mathbb{R}^{4 \times 4}$

$A \cdot (A \cdot B - C \cdot D) - E^2 \cdot D^2$

In [None]:
%%writefile /home/OpenMP/algebra.cpp
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define N 4

void matmul(double A[N][N], double B[N][N], double C[N][N]) {
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++) {
      double s = 0.0;
      for (int k = 0; k < N; k++) s += A[i][k] * B[k][j];
      C[i][j] = s;
    }
}

// alpha = +1 => add, alpha = -1 => sub
void matadd(double A[N][N], double B[N][N], double C[N][N], double alpha) {
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
      C[i][j] = A[i][j] + alpha * B[i][j];
}

void init(double M[N][N], double scale) {
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
      M[i][j] = scale * ((i + j) % 7);
}

void printmat(const char *name, double M[N][N]) {
  printf("%s:\n", name);
  for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++)
      printf("%6.2f ", M[i][j]);
    printf("\n");
  }
}

int main() {
  double A[N][N], B[N][N], C[N][N], D[N][N], E[N][N];
  double AB[N][N], CD[N][N], inner[N][N], Aterm[N][N];
  double E2[N][N], D2[N][N], E2D2[N][N], Result[N][N];

  init(A, 0.5); init(B, 0.7); init(C, 0.9);
  init(D, 1.1); init(E, 1.3);

  double t1 = omp_get_wtime();

  #pragma omp parallel
  #pragma omp single
  {
    // AB = A*B
    #pragma omp task depend(out:AB)
    matmul(A, B, AB);

    // CD = C*D
    #pragma omp task depend(out:CD)
    matmul(C, D, CD);

    // inner = AB - CD
    #pragma omp task depend(in:AB, CD) depend(out:inner)
    matadd(AB, CD, inner, -1.0);

    // Aterm = A * inner
    #pragma omp task depend(in:A, inner) depend(out:Aterm)
    matmul(A, inner, Aterm);

    // E2 = E*E
    #pragma omp task depend(out:E2)
    matmul(E, E, E2);

    // D2 = D*D
    #pragma omp task depend(out:D2)
    matmul(D, D, D2);

    // E2D2 = E2 * D2
    #pragma omp task depend(in:E2, D2) depend(out:E2D2)
    matmul(E2, D2, E2D2);

    // Result = Aterm - E2D2
    #pragma omp task depend(in:Aterm, E2D2) depend(out:Result)
    matadd(Aterm, E2D2, Result, -1.0);

    #pragma omp taskwait
  }

  double t2 = omp_get_wtime();
  printf("Execution time: %.3f ms\n", 1e3 * (t2 - t1));

  printmat("Result", Result);
  return 0;
}


Compile and run:

In [None]:
!g++ algebra.cpp -fopenmp -o algebra
!./algebra

### **Sum of Products**

Assume you are given strings repersenting algebraic expressions in the form of a sums of products with only 2 possible variable, "a" and "b", like:
<br>
"a+a\*b\*b+b\*b\*a\*b\*a+b\*b\*b"
<br>
The following program parses the expression and dynamically spawns an OpenMP task to solve each product, then reducing among those tasks to do the sum.

In [None]:
%%writefile /home/OpenMP/sop.cpp
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>

// Evaluate a product term like "a*b*b*a"
float eval_product(const char *term, float a, float b) {
    float res = 1.0f;
    for (const char *p = term; *p; p++) {
        if (*p == 'a') res *= a;
        else if (*p == 'b') res *= b;
    }
    return res;
}

int main() {
    const char *expr = "a*b*b*a+a*a+b*b";
    float a = 2.0f, b = 3.0f;

    printf("Expression: %s\n", expr);
    printf("a = %.2f, b = %.2f\n", a, b);

    // Split the expression into product terms by '+'
    char *expr_copy = strdup(expr);
    char *terms[64];
    int n_terms = 0;
    for (char *tok = strtok(expr_copy, "+"); tok; tok = strtok(NULL, "+"))
        terms[n_terms++] = tok;

    float partials[64] = {0.0f};
    float result = 0.0f;

    double t1 = omp_get_wtime();

    #pragma omp parallel
    #pragma omp single
    {
        // spawn a parsing task that spawns product tasks
        #pragma omp task depend(out:partials)
        {
            for (int i = 0; i < n_terms; i++) {
                int idx = i;
                const char *term = terms[i];
                #pragma omp task firstprivate(idx, term) depend(out:partials[idx])
                {
                    partials[idx] = eval_product(term, a, b);
                    // Optional diagnostic:
                    // printf("Term %d (%s) = %.2f\n", idx, term, partials[idx]);
                }
            }
        }

        // reduction task that depends on all partials
        #pragma omp task depend(in:partials) depend(out:result)
        {
            float sum = 0.0f;
            for (int i = 0; i < n_terms; i++) sum += partials[i];
            result = sum;
        }

        #pragma omp taskwait
    }

    double t2 = omp_get_wtime();

    printf("Result = %.2f\n", result);
    printf("Execution time: %.3f ms\n", 1e3 * (t2 - t1));

    free(expr_copy);
    return 0;
}


Compile and run:

In [None]:
!g++ sop.cpp -fopenmp -o sop
!./sop a*b*b*a+a*a+b*b+a*b*b*a*a*a+b*b*b*b*b+a*b*a*a*b*a+a*a+b*b*a*a*a

### **MatMul with Tasks**

Here we see a slightly new pragma, `taskloop`: splits the iteration space of a loop into OpenMP tasks.
While this looks similar to the `for` worksharing construct, the behavior is fundamentally different.
When using the `for` construct, all threads in the parallel region have to encounter the construct so that they can split up the work, whereas the taskloop construct needs only be executed by a single thread (e.g. the master thread).

*Note: the taskloop construct is defined in a way that is similar to the definition of a regular task.
This means that if N threads encounter the construct, each of the threads will start executing the same loop, so the loop will be executed N times rather than having a single incarnation of the loop split between them.*

The `grainsize` clause defines how many iterations should be executed per task.
Funnily enough, the OpenMP standard allows this to be an interval, which in this example will be eight to sixteen iterations, so an implementation has some flexibility in choosing the exact number of iterations (and, therefore, tasks).
The construct also supports the `num_tasks` clause, which specifies exactly how many tasks should be created, and then adjusts the chunk size accordingly.

In [None]:
%%writefile /home/OpenMP/tmatmul.cpp
#include <cstdio>
#include <cstdlib>
#include <omp.h>

void matmul_taskloop(float *C, const float *A, const float *B, size_t n) {
  #pragma omp parallel firstprivate(n)
  {
    #pragma omp master
    {
      #pragma omp taskloop firstprivate(n) grainsize(8)
      for (int i = 0; i < n; ++i)
        for (int k = 0; k < n; ++k)
          for (int j = 0; j < n; ++j)
            C[i * n + j] += A[i * n + k] * B[k * n + j];
    }
  }
}

void init_matrix(float *M, int n, float scale) {
  for (int i = 0; i < n; i++)
    for (int j = 0; j < n; j++)
      M[i*n + j] = scale * float((i + j) % 13);
}

int main() {
  int n = 512;

  printf("Matrix size: %d x %d\n", n, n);

  float *A = (float*) aligned_alloc(64, n * n * sizeof(float));
  float *B = (float*) aligned_alloc(64, n * n * sizeof(float));
  float *C = (float*) aligned_alloc(64, n * n * sizeof(float));

  init_matrix(A, n, 0.01f);
  init_matrix(B, n, 0.02f);
  init_matrix(C, n, 0.0f);

  double t1 = omp_get_wtime();
  matmul_taskloop(C, A, B, n);
  double t2 = omp_get_wtime();

  printf("Time: %.3f sec\n", t2 - t1);

  free(A);
  free(B);
  free(C);
}


Overwriting /home/OpenMP/tmatmul.cpp


Compile and run:

In [None]:
!g++ tmatmul.cpp -fopenmp -o tmatmul
!./tmatmul

Matrix size: 512 x 512
Time: 0.707 sec


## **Inspect The Hardware**

See what the machine we are using (here on Colab) is capable of:

In [12]:
%%writefile /home/OpenMP/inspect_hw.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <omp.h>
#include <unistd.h>
#include <thread>

// helper to read cache info from Linux sysfs
long read_cache_size(int level) {
    std::string path = "/sys/devices/system/cpu/cpu0/cache/index" + std::to_string(level) + "/size";
    std::ifstream file(path);
    if (!file.is_open()) return -1;
    std::string value;
    file >> value;
    long size = std::stol(value);
    if (value.find('K') != std::string::npos) size *= 1024;
    if (value.find('M') != std::string::npos) size *= 1024 * 1024;
    return size;
}

int main() {
  // CPU and threading info
  int omp_procs = omp_get_num_procs();
  int omp_max_threads = omp_get_max_threads();
  unsigned int hw_threads = std::thread::hardware_concurrency();

  std::cout << "=== System Info (OpenMP + Hardware) ===\n";
  std::cout << "Logical processors available (OpenMP): " << omp_procs << "\n";
  std::cout << "Max OpenMP threads: " << omp_max_threads << "\n";
  std::cout << "Hardware concurrency (std::thread): " << hw_threads << "\n";

  // hyperthreading available if hardware threads > physical cores
  if (hw_threads > omp_procs / 2)
    std::cout << "Hyperthreading likely enabled.\n";
  else
    std::cout << "No hyperthreading detected (or not applicable).\n";

  // cache info
  for (int i = 0; i < 3; ++i) {
    long size = read_cache_size(i);
    if (size > 0)
      std::cout << "L" << (i + 1) << " cache size: " << size / 1024 << " KB\n";
  }

  // RAM info
  long pages = sysconf(_SC_PHYS_PAGES);
  long page_size = sysconf(_SC_PAGE_SIZE);
  double total_ram_gb = (double)pages * page_size / (1024.0 * 1024.0 * 1024.0);
  std::cout << "Total physical memory: " << total_ram_gb << " GB\n";

  // OpenMP runtime confirmation
  #pragma omp parallel
  {
    #pragma omp single
    std::cout << "Actual threads used by default by OpenMP: " << omp_get_num_threads() << "\n";
  }

  return 0;
}


Writing /home/OpenMP/inspect_hw.cpp


Compile and run:

In [13]:
!g++ inspect_hw.cpp -fopenmp -o inspect_hw
!./inspect_hw

=== System Info (OpenMP + Hardware) ===
Logical processors available (OpenMP): 2
Max OpenMP threads: 2
Hardware concurrency (std::thread): 2
Hyperthreading likely enabled.
L1 cache size: 32 KB
L2 cache size: 32 KB
L3 cache size: 256 KB
Total physical memory: 12.6714 GB
Actual threads used by default by OpenMP: 2
