# Tasking

## Overview

A task in OpenMP is a unit of work that includes code to execute and its associated data environment.
Tasks can be created dynamically during the execution of a program.
Each task is executed independently and can be scheduled on any available thread.

Let's start with a simple, still serial example.
First we load our custom magic.

In [None]:
%load_ext ice.magic

In [None]:
%%cpp_omp -o code/tasking/serial.cpp -t

int a = 1, b = 2, c = 3;

std::this_thread::sleep_for(std::chrono::milliseconds(10 * a));
std::cout << "thread " << omp_get_thread_num() << ": a = " << a << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(10 * b));
std::cout << "thread " << omp_get_thread_num() << ": b = " << b << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
std::cout << "thread " << omp_get_thread_num() << ": c = " << c << std::endl;

## Task Creation

Tasks are created using the `task` directive.
When a thread encounters a task directive, it packages the associated code and data into a task, which is then placed in a task queue.

Created tasks are executed by the threads in the current team, but not necessarily by the thread that created them.
The OpenMP runtime system schedules tasks to threads dynamically based on availability, enabling better load balancing for irregular workloads.

In [None]:
%%cpp_omp -o code/tasking/task.cpp -t

int a = 1, b = 2, c = 3;

#pragma omp parallel  //# create a parallel region to spawn threads
#pragma omp single    //# in this example, one thread creates all tasks,
{                     //# but multiple threads may execute them
    #pragma omp task
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * a));
        std::cout << "thread " << omp_get_thread_num() << ": a = " << a << std::endl;
    }

    #pragma omp task
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * b));
        std::cout << "thread " << omp_get_thread_num() << ": b = " << b << std::endl;
    }

    #pragma omp task
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
        std::cout << "thread " << omp_get_thread_num() << ": c = " << c << std::endl;
    }
}

Data sharing rules are as already discussed with one exception: the default for private variables in enclosing constructs is `firstprivate`

In [None]:
%%cpp_omp -o code/tasking/firstprivate.cpp -t

#pragma omp parallel
#pragma omp single
{
    int privateVar = 5;
    #pragma omp task // implicit firstprivate(privateVar)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * privateVar));
        std::cout << "thread " << omp_get_thread_num() << ": privateVar = " << privateVar << std::endl;
        privateVar += 10;
    }

    std::cout << "thread " << omp_get_thread_num() << ": privateVar = " << privateVar << std::endl;
}

## Task Synchronization

The `taskwait` directive forces a thread to wait until all tasks *created by the current thread* are completed.
Recursive tasks are not included.

In [None]:
%%cpp_omp -o code/tasking/taskwait.cpp -t

int a = 1, b = 2, c = 3;

#pragma omp parallel
#pragma omp single
{
    #pragma omp task
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * a));
        std::cout << "thread " << omp_get_thread_num() << ": a = " << a << std::endl;
    }

    #pragma omp task
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * b));
        std::cout << "thread " << omp_get_thread_num() << ": b = " << b << std::endl;
    }

    #pragma omp task
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
        std::cout << "thread " << omp_get_thread_num() << ": c = " << c << std::endl;

        #pragma omp task
        {
            std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
            std::cout << "thread " << omp_get_thread_num() << ": recursive task c = " << c << std::endl;
        }
    }

    #pragma omp taskwait

    #pragma omp task
    std::cout << "thread " << omp_get_thread_num() << ": all tasks finished" << std::endl;
}

## Taskgroups

Taskgroups are used to group tasks together so that they can be waited on as a group rather than individually.
The `taskgroup` directive provides a way to manage tasks in a structured manner and includes an implicit synchronization at the end.
In contrast to `taskwait`, recursive tasks are included.

In [None]:
%%cpp_omp -o code/tasking/taskgroup.cpp -t

int a = 1, b = 2, c = 3;

#pragma omp parallel
#pragma omp single
{
    #pragma omp taskgroup
    {
        #pragma omp task
        {
            std::this_thread::sleep_for(std::chrono::milliseconds(10 * a));
            std::cout << "thread " << omp_get_thread_num() << ": a = " << a << std::endl;
        }

        #pragma omp task
        {
            std::this_thread::sleep_for(std::chrono::milliseconds(10 * b));
            std::cout << "thread " << omp_get_thread_num() << ": b = " << b << std::endl;
        }

        #pragma omp task
        {
            std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
            std::cout << "thread " << omp_get_thread_num() << ": c = " << c << std::endl;

            #pragma omp task
            {
            std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
            std::cout << "thread " << omp_get_thread_num() << ": recursive task c = " << c << std::endl;
            }
        }
    } //# implicit synchronization - wait for all tasks from this taskgroup to finish

    #pragma omp task
    std::cout << "thread " << omp_get_thread_num() << ": taskgroup finished" << std::endl;
}

## Task Dependencies

Tasks can have dependencies on other tasks using the `depend` clause.
This ensures that a task will only execute after the tasks it depends on have completed.

If no previous matching dependency to a listed variable exists, it is assumed as fulfilled $\rightarrow$ the order of creating tasks matters.

<div class="alert alert-block alert-info"> <b>Note:</b> Tasks don't have to use the variables they depend on. </div>

In [None]:
%%cpp_omp -o code/tasking/depend.cpp -t

int a = 1, b = 2, c = 3;

#pragma omp parallel  //# create a parallel region to spawn threads
#pragma omp single    //# in this example, one thread creates all tasks,
{                     //# but multiple threads may execute them
    #pragma omp task depend(out : b)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * a));
        std::cout << "thread " << omp_get_thread_num() << ": a = " << a << std::endl;
        b = a + 10;
    }

    #pragma omp task depend(in : b) depend(out : c)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * b));
        std::cout << "thread " << omp_get_thread_num() << ": b = " << b << std::endl;
        c = b + 10;
    }

    #pragma omp task depend(in : c)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
        std::cout << "thread " << omp_get_thread_num() << ": c = " << c << std::endl;
    }
}

## Yielding Tasks & Untied Tasks

The `taskyield` directive introduces an explicit task scheduling point (TSP).
It allows a thread to yield its current task, giving other tasks the opportunity to execute.
This can be useful when tasks are long-running, ensuring that no single task monopolizes CPU time.
\
The `task` construct, as well as barriers, carry an implicit TSP.

By default, tasks in OpenMP are *tied*, that is once a thread begins executing a task, the same thread must complete it.
By specifying task to be `untied`, they are able to suspend and be resumed by a different thread.

In [None]:
%%cpp_omp -o code/tasking/taskyield.cpp -t

#pragma omp parallel num_threads(4)
#pragma omp single
{
    for (int t = 0; t < 10; ++t)
    {
        #pragma omp task untied
        {
            auto start = omp_get_wtime();

            #pragma omp critical
                std::cout << "thread " << omp_get_thread_num() << " started work on task " << t << std::endl;

            for (int i = 0; i < 10; ++i) {
                #pragma omp critical
                    std::cout << "thread " << omp_get_thread_num() << " works on task " << t << std::endl;
                std::this_thread::sleep_for(std::chrono::milliseconds(100));
                #pragma omp taskyield
            }

            auto end = omp_get_wtime();

            #pragma omp critical
                std::cout << "thread " << omp_get_thread_num() << " finished work on task " << t << " after " << 1e3 * (end - start) << " ms" << std::endl;
        }
    }
}

## Conditional Tasks

Adding an `if` clause controls whether a task is *deferred* (true), i.e. the task is possibly executed later, or *undeferred* (false), i.e. the task is executed immediately.

In [None]:
%%cpp_omp -o code/tasking/if.cpp -t

int a = 1, b = 2, c = 3;
bool deferred = false;

#pragma omp parallel
#pragma omp single
{
    #pragma omp task if(deferred)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * a));
        std::cout << "thread " << omp_get_thread_num() << ": a = " << a << std::endl;
    }

    #pragma omp task if(deferred)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * b));
        std::cout << "thread " << omp_get_thread_num() << ": b = " << b << std::endl;
    }

    #pragma omp task if(deferred)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * c));
        std::cout << "thread " << omp_get_thread_num() << ": c = " << c << std::endl;
    }
}

A similar mechanic is provided by the `final` clause.
When evaluated to true, the task and all children become final and included tasks, i.e. get executed directly by the same thread.

## Additional Material

* Priority
    * Tasks can be given a priority with the `priority` clause
    * This is a hint for the run-time system
    * Bevahior must be enabled by setting the environment variable `OMP_MAX_TASK_PRIORITY`

* Task loops
    * Can be nested arbitrarily
    * Automatic load balancing
    * Allows various task- and loop-related clauses

In [None]:
%%cpp_omp -o code/tasking/taskloop.cpp -t

#pragma omp parallel num_threads(2)
#pragma omp single
{
    #pragma omp taskloop
    for (int i = 0; i < 5; ++i) {
        std::this_thread::sleep_for(std::chrono::milliseconds(10 * i));
        std::cout << "thread " << omp_get_thread_num() << ": i = " << i << std::endl;
    }
}


* Task Reductions (available since OpenMP 5.0)
    * Reductions including recursive tasks are supported

In [None]:
%%cpp_omp -o code/tasking/reduction.cpp -t

int a = 1, b = 5, c = 10;
int sum = 0;

#pragma omp parallel
#pragma omp single
{
    #pragma omp taskgroup task_reduction(+ : sum)
    {
        #pragma omp task in_reduction(+ : sum)
            sum += a;

        #pragma omp task in_reduction(+ : sum)
            sum += b;

        #pragma omp task in_reduction(+ : sum)
        {
            sum += c;
            #pragma omp task in_reduction(+ : sum)
                sum += c;
        }
    } //# end of taskgroup - sum available

    #pragma omp task
        std::cout << "thread " << omp_get_thread_num() << ": sum = " << sum << std::endl;
}

## Exercise: Tasked with Cooking

<div class="alert alert-block alert-success"> <b>Exercise:</b> Apply tasking to a mock simulation. </div>

The code example below simulates the process of preparing a classic meal -- burger and fries.
While this might not seem particularly relevant for scientific applications, it still represents a multi-step process with non-trivial dependencies.
In practice, the single steps would most likely be somewhat more technical, such as computing Eigenvalues of sub-matrices or computing an error norm.


Your task is to taskify the example at hand following these restrictions:
* Everything involving a knife (raw fries, tomatoes, salad) needs to be serialized since there is only one knife.
* Everything involving an oven (fries, bun) can be done in parallel -- it is a big oven.
* The raw fries can only be put into the oven once they are completely cut.
* Frying can be done in parallel to everything else (excluding assembly).
* The final assembly can only be started when all components are ready.

In [None]:
%%cpp_omp -o code/tasking/cooking.cpp

auto start = omp_get_wtime();

std::this_thread::sleep_for(std::chrono::milliseconds(300));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Cutting raw fries done" << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(200));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Cutting tomatoes done" << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(200));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Preparing salad done" << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(500));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Baking fries done" << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(300));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Baking bun done" << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(300));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Frying patty done" << std::endl;

std::this_thread::sleep_for(std::chrono::milliseconds(200));
std::cout << 1e3 * (omp_get_wtime() - start) << ": Assembly done" << std::endl;

std::cout << "Meal completed after " << 1e3 * (omp_get_wtime() - start) << std::endl;

### Solution

In [None]:
%%cpp_omp -o code/tasking/cooking-solution.cpp

auto start = omp_get_wtime();

int rawFries, fries, tomatoes, salad, patty, bun;

#pragma omp parallel
#pragma omp single
{
    #pragma omp task depend(out : rawFries)
    #pragma omp critical (knife)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(300));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Cutting raw fries done" << std::endl;
    }

    #pragma omp task depend(out : tomatoes)
    #pragma omp critical (knife)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Cutting tomatoes done" << std::endl;
    }

    #pragma omp task depend(out : salad)
    #pragma omp critical (knife)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Preparing salad done" << std::endl;
    }

    #pragma omp task depend(in : rawFries) depend(out : fries)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(500));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Baking fries done" << std::endl;
    }

    #pragma omp task depend(out : bun)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(300));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Baking bun done" << std::endl;
    }

    #pragma omp task depend(out : patty)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(300));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Frying patty done" << std::endl;
    }

    #pragma omp task depend(in : fries, bun, patty, tomatoes, salad)
    {
        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        std::cout << 1e3 * (omp_get_wtime() - start) << ": Assembly done" << std::endl;
    }
}

std::cout << "Meal completed after " << 1e3 * (omp_get_wtime() - start) << std::endl;