# Workfow Management and Meta Job Scheduler

This tutorial illustrates how computations can be executed on large 
(or small) compute servers, aka.
high-performance-computers (HPC), aka. compute clusters, aka. supercomputers, etc.

This means, the computation is not executed on the local workstation 
(or laptop) but on some other computer.
This approach is particulary handy for large computations, which run
for multiple hours or days, since a user can 
e.g. shutdown or restart his personal computer without killing the compute job. 

*BoSSS* features a set of classes and routines 
(an API, application programing interface) for communication with 
compute clusters. This is especially handy for **scripting**,
e.g. for parameter studies, where dozens of computations 
have to be started and monitored.

First, we initialize the new worksheet;
Note: 
1. This tutorial can be found in the source code repository as as `MetaJobManager.ipynb`. 
   One can directly load this into Jupyter to interactively work with the following code examples.
2. **In the following line, the reference to `BoSSSpad.dll` is required**. 
   You must either set `#r "BoSSSpad.dll"` to something which is appropirate for your computer
   (e.g. `C:\Program Files (x86)\FDY\BoSSS\bin\Release\net5.0\BoSSSpad.dll` if you installed the binary distribution),
   or, if you are working with the source code, you must compile `BoSSSpad` and put it side-by-side to this worksheet file
   (from the original location in the repository, you can use the scripts `getbossspad.sh`, resp. `getbossspad.bat`).

In [None]:
#r "BoSSSpad.dll"
//#r "../../../src/L4-application/BoSSSpad/bin/Debug/net6.0/BoSSSpad.dll"
using System;
using System.Collections.Generic;
using System.Linq;
using ilPSP;
using ilPSP.Utils;
using BoSSS.Platform;
using BoSSS.Foundation;
using BoSSS.Foundation.Grid;
using BoSSS.Foundation.Grid.Classic;
using BoSSS.Foundation.IO;
using BoSSS.Solution;
using BoSSS.Solution.Control;
using BoSSS.Solution.GridImport;
using BoSSS.Solution.Statistic;
using BoSSS.Solution.Utils;
using BoSSS.Solution.Gnuplot;
using BoSSS.Application.BoSSSpad;
using BoSSS.Application.XNSE_Solver;
using static BoSSS.Application.BoSSSpad.BoSSSshell;
Init();


## Batch Processing

First, we have to select a *batch system* (aka.*execution queue*, aka. *queue*) that we want to use.
Batch systems are a common approach to organize workloads (aka. compute jobs)
on compute clusters.
On such systems, a user typically does **not** starts a simulation manually/interactively.
Instead, he specifies a so-called *compute job*. The *scheduler* 
(i.e. the batch system) collects 
compute jobs from all users on the compute cluster, sorts them according to 
some priority and puts the jobs into some queue, also called *batch*.
The jobs in the batch are then executed in order, depending on the 
available hardware and the scheduling policies of the system.

The *BoSSS* API provides front-ends (clients) for the following 
batch system software:

- `BoSSS.Application.BoSSSpad.SlurmClient` for the 
Slurm Workload Manager (very prominent on Linux HPC systems)
- `BoSSS.Application.BoSSSpad.MsHPC2012Client`
for the Microsoft HPC Pack 2012 and higher
- `BoSSS.Application.BoSSSpad.MiniBatchProcessorClient` for the 
mini batch processor, a minimalistic, *BoSSS*-internal batch system which mimiks 
a supercomputer batch system on the local machine.


A list of clients for various batch systems, which are loaded at the 
`Init()` command can be configured through the  
`~/.BoSSS/etc/BatchProcessorConfig.json`-file.
If this file is missing, a default setting, containing a 
mini batch processor, is initialized. 

The list of all execution queues can be accessed through:

In [None]:
ExecutionQueues

In order to run a simulation job, one can either manually select one of these queues -- or, one culd just use the **default queue**.  The default queue for execution can be configured by two options:
- globally, can specified by the `DefaultQueueIndex` in configuration file `~/.BoSSS/etc/BatchProcessorConfig.json`
- for each project (see below), it can be overwritten using the file `~/.BoSSS/etc/DefaultQueuesProjectOverride.txt`

### Note on the Mini Batch Processor:
The batch processor for local jobs can be started separately (by launching
`MiniBatchProcessor.exe` or `dotnet MiniBatchProcessor.dll`), **which is the prefferred option**.
Alternatively, it can be started from Jupyter Notebook; it depends on the operating system, whether the 
`MiniBatchProcessor.exe` is terminated with the notebook kernel, or not.
If no mini-batch-processor is running, it is started (hopefully) upon Job activation.


## Initializing the workflow management

In order to use the workflow management, 
the very first thing we have to do is to initialize it by defineing 
a **project name**, here it is `MetaJobManager_Tutorial`. 
This is used to generate names for the compute jobs and to 
identify sessions in the database:

In [None]:
BoSSSshell.WorkflowMgm.Init("MetaJobManager_Tutorial");

For this project, the default execution queue is set to:

In [None]:
GetDefaultQueue()

We verify that we have no jobs defined so far ...

In [None]:
BoSSSshell.WorkflowMgm.AllJobs

In [None]:
// the folloowing line is part of the trest system and not neccesary in User worksheets:
NUnit.Framework.Assert.IsTrue(BoSSSshell.WorkflowMgm.AllJobs.Count == 0, "MetaJobManager tutorial: expecting 0 jobs on entry.");

The initialization of the Workflow Management environment already creates, resp. opens
a *BoSSS database* with the same name as the project name as the project. The current default database is set as:

In [None]:
wmg.DefaultDatabase

In [None]:
// From previous versions of the code, not required anymore:
//var myLocalDb = myBatch.CreateTempDatabase();

### Notes on databases:
* For expensive simulations, which run for days or longer, temporaray databases,
  which are created with a different name every time the notebook is executed, 
  are typically **not** desired.
  Hence, one wants the compute jobs to persist, i.e. if the worksheet is re-executed (maybe on another day), 
  but the computation 
  has been successful somewhen in the past, this result is recovered from the database.
  In such a scenario, one cannot use a temporary database, but the default project database
  should be used instead.
* Under certain circumstances, one could also create additional databases:
  Using the methods `BatchProcessorClient.CreateTempDatabase()`, 
  resp. `BatchProcessorClinet.CreateOrOpenCompatibleDatabase(...)`.
  (as demonstrated below) ensures that the database is in a directory which can be accessed by the batch system.
  (Alternative functions, i.e. `BoSSSshell.CreateTempDatabase()` or `BoSSSshell.OpenOrCreateDatabase(...)` 
  do not guarantee this and the user has to ensure an appropriate location.

All currently opened databases can be listed using:

In [None]:
databases

## Loading a BoSSS-Solver and Setting up a Simulation

As an example, we use the workflow management tools to simulate 
incompressible channel flow, therefore we have to import the namespace,
and repeat the steps from the IBM example (Tutorial 2) in order to setup the
control object:

In [None]:
using BoSSS.Application.XNSE_Solver;

We create a grid with boundary conditions:

In [None]:
var xNodes       = GenericBlas.Linspace(0, 10 , 41); 
var yNodes       = GenericBlas.Linspace(-1, 1, 9); 
GridCommons grid = Grid2D.Cartesian2DGrid(xNodes, yNodes);
grid.DefineEdgeTags(delegate (double[] X) { 
    double x = X[0];
    double y = X[1]; 
    if (Math.Abs(y - (-1)) <= 1.0e-8) 
        return "wall"; // lower wall
    if (Math.Abs(y - (+1)) <= 1.0e-8) 
        return "wall"; // upper wall
    if (Math.Abs(x - (0.0)) <= 1.0e-8) 
        return "Velocity_Inlet"; // inlet
    if (Math.Abs(x - (+10.0)) <= 1.0e-8) 
        return "Pressure_Outlet"; // outlet
    throw new ArgumentOutOfRangeException("unknown domain"); 
});

One can save this grid explicitly to a database, but it is not a must;
The grid should be saved automatically, when the job is activated.

In [None]:
//wmg.DefaultDatabase.SaveGrid(ref grid);

Next, we create the control object for the incompressible simulation:

In [None]:
var c = new XNSE_Control();
// general description:
int k                = 1;
string desc          = "Steady state, channel, k" + k; 
c.SessionName        = "SteadyStateChannel"; 
c.ProjectDescription = desc;
c.savetodb           = true;
c.Tags.Add("k" + k);
// setting the grid:
c.SetGrid(grid);
// DG polynomial degree
c.SetDGdegree(k);
// Physical parameters:
double reynolds            = 20; 
c.PhysicalParameters.rho_A = 1; 
c.PhysicalParameters.mu_A  = 1.0/reynolds;
// Timestepping properties:
c.TimesteppingMode = AppControl._TimesteppingMode.Steady;

The specification of boundary conditions and initial values
is a bit more complicated if the job manager is used:

Since the solver is executed in an external program, the control object 
has to be saved in a file. For lots of complicated objects,
especially for delegates, C\# does not support serialization 
(converting the object into a form that can be saved on disk, or 
transmitted over a network), so a workaround is needed.
This is achieved e.g. by the **Formula** object, where a C\#-formula
is saved as a string.

In [None]:
var WallVelocity = new Formula("X => 0.0", false); // 2nd Argument=false says that its a time-indep. formula.

Testing the formula:

In [None]:
WallVelocity.Evaluate(new[]{0.0, 0.0}, 0.0) // evaluationg at (0,0), at time 0

In [None]:
// [Deprecated]
/// A disadvantage of string-formulas is that they look a bit ``alien''
/// within the worksheet; therefore, there is also a little hack which allows 
/// the conversion of a static memeber function of a static class into a 
/// \code{Formula} object:

In [None]:
// Deprecated, this option is no longer supported in .NET5
static class StaticFormulas {
    public static double VelX_Inlet(double[] X) {
        //double x  = X[0];
        double y  = X[0];
        double UX = 1.0 - y*y;
        return UX;
    }  
 
    public static double VelY_Inlet(double[] X) {
        return 0.0;
    }      
}

In [None]:
// InletVelocityX = GetFormulaObject(StaticFormulas.VelX_Inlet);
//var InletVelocityY = GetFormulaObject(StaticFormulas.VelY_Inlet);

In [None]:
var InletVelocityX = new Formula("X => 1 - X[0]*X[0]", false);
var InletVelocityY = new Formula("X => 0.0", false);

Finally, we set boundary values for our simulation. The initial values
are set to zero per default; for the steady-state simulation initial
values are irrelevant anyway:

Initial Values are set to 0

In [None]:
c.BoundaryValues.Clear(); 
c.AddBoundaryValue("wall", "VelocityX", WallVelocity); 
c.AddBoundaryValue("Velocity_Inlet", "VelocityX", InletVelocityX);  
c.AddBoundaryValue("Velocity_Inlet", "VelocityY", InletVelocityY); 
c.AddBoundaryValue("Pressure_Outlet");

## Activation and Monitoring of the the Job

Finally, we are ready to deploy the job at the batch processor;
In a usual work flow scenario, we **do not** want to (re-) submit the 
job every time we run the worksheet -- usually, one wants to run a job once.

The concept to overcome this problem is job activation. If a job is 
activated, the meta job manager first checks the databases and the batch 
system, if a job with the respective name and project name is already 
submitted. Only if there is no information that the job was ever submitted
or started anywhere, the job is submitted to the respective batch system.

First, a `Job* -object is created from the control object:

In [None]:
var JobLocal = c.CreateJob();

This job is not activated yet, it can still be configured:

In [None]:
JobLocal.Status

In [None]:
// Test:
NUnit.Framework.Assert.IsTrue(JobLocal.Status == JobStatus.PreActivation);

### Starting the compute Job

One can change e.g. the number of MPI processes:

In [None]:
JobLocal.NumberOfMPIProcs = 1;

Note that these jobs are desigend to be **persistent**:
This means the computation is only started 
**once for a given control object**, no matter how often the worksheet
is executed. 

Such a behaviour is useful for expensive simulations, which run on HPC
servers over days or even weeks. The user (you) can close the worksheet
and maybe open and execute it a few days later, and he can access
the original job which he submitted a few days ago (maybe it is finished
now).

Then, the job is activated, resp. submitted, resp. deployed 
to one batch system.
If job persistency is not wanted, traces of the job can be removed 
on request during activation, causing a fresh job deployment at the
batch system:

In [None]:
JobLocal.Activate();  // execute the job in the default execution queue
//JobLocal.Activate(ExecutionQueues[4]);  // execute the job e.g. in queue 4

All jobs can be listed using the workflow management:

In [None]:
BoSSSshell.WorkflowMgm.AllJobs

Check the present job status:

In [None]:
JobLocal.Status

In [None]:
/// BoSSScmdSilent BoSSSexeSilent
NUnit.Framework.Assert.IsTrue(
   JobLocal.Status == JobStatus.PendingInExecutionQueue
   || JobLocal.Status == JobStatus.InProgress
   || JobLocal.Status == JobStatus.FinishedSuccessful);

### Evaluation of Job

Here, we block until both of our jobs have finished:

In [None]:
BoSSSshell.WorkflowMgm.BlockUntilAllJobsTerminate(1000);

We examine the output and error stream of the job:
This directly accesses the *\tt stdout*-redirection of the respective job
manager, which may contain a bit more information than the 
`Stdout`-copy in the session directory.

In [None]:
JobLocal.Stdout

Additionally we display the error stream and hope that it is empty:

In [None]:
JobLocal.Stderr

We can also obtain the session 
which was stored during the execution of the job:

In [None]:
var Sloc = JobLocal.LatestSession;
Sloc

 We can also list all attempts to run the job at the assigend processor:

In [None]:
JobLocal.AllDeployments

In [None]:
NUnit.Framework.Assert.IsTrue(JobLocal.AllDeployments.Count == 1, "MetaJobManager tutorial: Found more than one deployment.");

Finally, we check the status of our jobs:

In [None]:
JobLocal.Status

 If anything failed, hints on the reason why are provides by the `GetStatus` method:

In [None]:
JobLocal.GetStatus(WriteHints:true)

In [None]:
NUnit.Framework.Assert.IsTrue(JobLocal.Status == JobStatus.FinishedSuccessful, "MetaJobManager tutorial: Job was not successful.");

## Exporting Plots

Each run of the solver corresponds to one **session** in the database. A session is basically a collection of information on the entire solver run, i.e. the simulation result, input and solver settings as well as meta-data such as computer and daten and time.

Since in this tutorial only one solver run was executed, there is only one session in the Workflow Management (`wmg` is just an alias for `BoSSSshell.WorkflowMgm.`):

In [None]:
wmg.Sessions

We select the first (and only) session and create an **export instruction** object.
The **supersampling** setting increases the output resolution. 
This is required to vizualize high-order DG ploynomials with the low-order Tecplot-format.
Tecplot can only vizualize a linear interpolation within a cell. With a second-degree supersampling,
each cell is subdivided twice (in 2D, one subdivision is 4 cells, i.e. 2 subdivisions are $4^2 = 16$ cells).
In this way, the curve of e.g. a secondd order polynomial can be represented with the linear interpolation over 16 cells.

In [None]:
var outPath = wmg.Sessions[0].Export().WithSupersampling(2).Do();

On the respective directory (see output above) one should finaaly find plot-files which than can be used
for further post processing in third-party software such as Paraview, LLNL Visit or Tecplot.

The `Do()` command returns the location of the output files:

In [None]:
outPath

To finalize this tutorial, we list all files in the plot output directory:

In [None]:
System.Threading.Thread.Sleep(10000); // just wait for the external plot application to finish
System.IO.Directory.GetFiles(outPath, "*").Select(fullPath => System.IO.Path.GetFileName(fullPath))