Skip to content
Ryan Richard edited this page Aug 14, 2014 · 11 revisions

####Table of Contents

  1. [What is MPI?] (#intro)
  2. Proposed Changes
  3. Printer Object
  4. WorldComm Object
  5. What an MPI developer needs to know
  6. What a non-MPI developer needs to know
## What is MPI? [Return to top](#Top)

Before we get to the title question let us define some terminology (for the computer nerds out there, yes, I realize what I am saying is a simplification). A typical supercomputer can be thought of as a series of computers that have been networked together. Each computer is typically termed a node, and each processor, within a node is termed a core. Each node has its own memory and possibly hard drive (disk). The memory within a node is known as "shared memory" because each core within the node shares it. On the scale of the supercomputer, however, all of the memory together is known as "distributed memory" because it is distributed among the nodes.

It is very important to understand the previous two sentences before moving on. Under a distributed memory model each core can only directly access a small part of the total memory available on the supercomputer, specifically the memory that is on its same node. All other parts of the supercomputers memory can be accessed by a particular core, but it must be done indirectly. In general the overwhelming majority of the memory available for a particular program, is only available to each core indirectly.

Unfortunately indirect, or more commonly termed "remote", access of memory is costly in terms of time because the data needs to be sent via "slow" cables among the nodes. On the other hand, the shared memory available to a core can be accessed very quickly. What this amounts to is programs need to take into account where stored data lives and try to minimize accessing memory outside of the node. Consequentially, it is essential that the programmer can micromanage these operations and that the resulting programs store data in logical layouts.

MPI stands for message passing interface and is the most common way of micromanaging the previously described memory situation. Under MPI, tasks (a generalization of nodes and cores, that for the time being can just be thought of as equivalent to nodes) pass messages among themselves. These messages are the remote data. Say I have a 10-dimensional vector and each element "x" of that vector is stored on node "x" then, if say node 2 needs element 3, node 3 sends element 3 to node 2 as a message. Passing is required because node 2's memory is distinct from that of node 3 (we are operating under a distributed memory situation). In practice one will not store one element of a vector per node, but rather blocks of the vector will be on each node. MPI then moves around blocks of the vector, not elements. Because in C++ essentially every pointer can be viewed as a vector, hopefully the generalization to other types of data is clear.

MPI provides numerous routines to pass these messages among nodes. At the most primitive level are what are called "Send" and "Receives". These are messages that are between two MPI tasks; one task sends the message the other receives it. Send and receives can then be combined into more complex algorithms, suitable for more complicated data movement. Without going into too much more detail, there are numerous other common scenarios that occur, such as all tasks wanting to know element 3; this is called a broadcast and MPI also provides routines for it. You can also imagine that each task has computed part of some very large summation and we want to sum the these intermediate results to get one final answer; this is known as a reduce (or all reduce) and again MPI provides a method to do it. In general, nearly any movement of data that is needed has a corresponding MPI routine, and those routines are to be preferred as they are highly optimized.

In order to understand what needs to be done to Psi4 for it to run under MPI it is important that one has an elementary understanding of how MPI actually works. Because Psi4 currently has OpenMP parallelization we start with how that works first. Under OpenMP parallelization a single executable is started at runtime; it then proceeds through the code until it hits an OpenMP directive. At this point the code becomes parallel and remains parallel until the directive ends. When the directive ends we have a serial environment again. Thus parallel "gotchas" need only be considered in the parallel regions. An MPI run of say "n" MPI tasks works by starting "n" separate executions of the program. Each one of these executions then runs through the source code until it hits an MPI directive telling that task to either do something or not do something. After the MPI directive, the code is still parallel. Thus the important distinction is that the parallel "gotchas" are not mitigated to parallel regions, but instead need to be accounted for throughout the entire code.

Finally, it is worth briefly touching on MPI's relationship to OpenMP. OpenMP only works for shared memory, whereas MPI technically works for both shared and distributed memory. However, because OpenMP's syntax is so much simpler than MPI's it is not uncommon to use OpenMP to manage shared memory parallelization and MPI to manage the distributed parallelization; this is known as MPI/OpenMP hybrid parallelization. Again there is no reason why MPI can't handle the shared memory as well as the distributed memory other than ease of syntax. On the other hand, OpenMP can not manage distributed memory at all and MPI is required. Routines that are currently written with only OpenMP parallelization can often times benefit from MPI parallelization simply by batching the routine and letting each MPI task run a batch.

## Proposed changes [Return to top](#Top)

A series of big, behind-the-scenes changes are needed in order to make Psi4 MPI compatible. First and foremost we need to implement an entire overhaul of how files are handled. Currently there are no restrictions on which MPI task reads/or writes files (including the output file). This means each MPI task reads and writes each file. If we have two MPI tasks running on a node (I know I told you task==node, but in reality a task may actually be half a node, or really any combination of cores) this raises problems in that the two tasks are now competing to read and write the same data file. If while task 1 is writing the file, task 2 starts trying to write the file, the file will be garbled. Conversely if task 1 is trying to read the file and task 2 is writing it as task 1 reads it, task 1 will get incomplete information. Of course there is also the possibility that the two tasks are sufficiently out of synch that task 1 writes the file, then task 2 writes the file and the only harm done is that the file is written twice; however, such behavior should never be assumed.

Some attempts in Psi4 to do just this existed already. Numerous libraries (libmints and libpsio in particular did a good job) were careful to make sure only task 0 (in true C style, MPI numbers tasks starting with 0 and as a result task 0 is often thought of as the master task, although there really is nothing intrinsically special about it other than we are always guaranteed to have a task 0) edited files. In principle, this behavior should be contained within an object so that each developer who reads and writes files should only have to use this object for file manipulations and the object will guarantee everything is done "on the up and up". Psi4 also provides the (slightly hidden) option to print output directly to the screen. Although not a file per se, such printing still suffers from the possibility of each message possibly being printed multiple times. Because within C++ stdout and files are polymorphically related, both of these problems can be handled by the same solution a new printer object.

The second large change is implementing a class that will handle all of the MPI calls. With OpenMP parallelism each developer was (and still is) free to include their own OpenMP calls for their code. MPI, however, is much more complicated and it is beneficial to have an object (called WorldComm for historical reasons) that will take care of much of the complexity for the developer. The details of WorldComm are only relevant to developers who intend to use MPI, so we don't discuss it here.

## Printer Object [Return to top](#Top)

The most important change for developers to grasp pertaining to the keeping of Psi4 MPI compatible is the new printer object (I'm sloppy with my C++ lingo and use object and class interchangeably, technically an object refers to a specific instance of a class). Printer objects are found in most large source codes, and as the name implies, they control all of the printing (be it to a file or the screen). This can be things like which MPI task gets to print, how many columns are allowed per line (I really badly want to enforce an 80 column limit, but have refrained...), functions for printing banners and tables, "grep-able" lines, or just in general standardizing common printing tasks.

Currently Psi4's printer object is only worried about ASCII files as they are what is preventing Psi4 from running in parallel at the moment, but in theory the printer object can be expanded to binary files quite easily as well simply by changing the flags used to open the internal ostream or istream objects. Back to the printer object. I call it an object, but it's actually a hierarchy of classes, that mirror the C++ ios classes. The base class of the object is what I have called PsiStreamBase and it lives in the file: libparallel/StreamBase.h. Operationally PsiStreamBase looks like:

class PsiStreamBase{
   protected:
     bool ImSpecial();
     stringstream Buffer_;
};

ImSpecial() defines the criteria under which a PsiStream actually gets to read/write to/from its Buffer_. If ImSpecial() returns true, that means the current MPI task is allowed to read/write, otherwise, it's more or less along for the ride. As was more or less inferred from the previous statements Buffer_ is where each MPI task stores the data.

PsiStreamBase has two child classes PsiOutStream and PsiInStream. Operationally,PsiOutStream looks like:

class PsiOutStream : public PsiStreamBase{
     private:
          void DumpBuffer();
     protected:
          boost::shared_ptr<ostream> Stream_;
     public:
          void Printf(const char* format,...);

          template<typename T>
          ostream& operator<<(const T& message);
};

The PsiInStream class is very similar, except that instead of being geared at writing to a stream, it is geared towards reading from a stream and one can just take the following discussion and substitute "write/print" with "read" and ostream with istream.

Walking through the class members. Stream_ is a pointer to the stream we are writing our data to. Because ofstreams (files) are derived from ostreams (stdout and stderr), this is the class that actually implements the mechanism for printing to stdout and to an outfile. Two interfaces to write are provided. The first is the printf-like syntax that C defines, the other is the stream interface that C++ defines. Both interfaces ultimately dump their data into Buffer_, so it is irrelevant which one is used, and developers are free to used the one that is more convenient for them. As of right now every time either Printf() or operator<<() is called it then will call DumpBuffer(). DumpBuffer() is the common "leg-work" that needs to be done to convert Buffer_ to Stream_, implicit in which is a call to ImSpecial() to determine who actually gets to print.

Derived from PsiOutStream and PsiInStream, respectively are OutFile and InFile. For the most part these are the classes you will likely be interacting with. Again, the reading and writing mechanisms are implemented at the PsiOutStream or PsiInStream level, so the only thing this level really adds is the mechanisms to open and close Stream_, which is a complexity that did not exist for stdout and stdin. Again, operationally OutFile looks like:

class OutFile: public PsiOutStream{
 public:
   OutFile(const std::string& name="",const FileMode& mode=APPEND);
   void Open(const std::string& name ,const FileMode& mode );
   void Close();
};

If a filename is given to the OutFile constructor a file with that name will be opened by calling Open(). If you don't know the file name at construction or for whatever reason you don't want to open it right away, you may instead call the Open() function after construction. If the file cannot be opened for whatever reason a PSIEXCEPTION will be thrown so you are guaranteed that the file is truly open when you go to write to it. The second argument to both the constructor and Open() are an enumerated type called FileMode. These are things like APPEND or TRUNCATE, which respectively mean "start writing at the end of the file, leaving contents alone" and "overwrite the file". The last thing to note is that the file name that is passed in is relative to Psi4's current directory.

As for usage of these objects, the main usage will be printing to the output file (or stdout) as determined by the user. In include/psi4-dec.h there is a line like:

namespace psi{
...
extern boost::shared_ptr<PsiOutStream> outfile;
...
}

As long as psi4-dec.h is in scope you can print to wherever the output is being directed (either a file or stdout) by the following options:

#include "psi4-dec.h" //Gives us psi::outfile
#include <iomanip> //Needed for setwidth(), setprecision(), etc.

/* Outside the psi namespace, (not that you'd ever do that) */
psi::outfile->Printf("Hello %d Worlds!!!\n",2);
(*psi::outfile)<<"Hello "<<2<<" Worlds!!!"<<endl;

/*Both print "Hello 2 Worlds!!!" */

/* Inside psi namespace, no need for scoping */
outfile->Printf("Pi is %16.12f\n",3.14159265358979323846264338327950288419716939);
(*outfile)<<"Pi is "<<setwidth(16)<<setiosflags(ios::fixed)<<setprecision(12)<<3.14159265358979323846264338327950288419716939<<endl;

/* Both print "    3.141592653590"*/

Incidentally, let me take this time to rant about a Psi4 coding practice I see everywhere. Note that I put #include "psi4-dec.h" and not #include <psi4-dec.h>. There is a big difference. Angle brackets are used for include files that are likely to come from standard libraries (defined as anything that a user would have in /usr/lib, so Boost, BLAS, C standard library, etc.), whereas quotes are for includes that are part of the project being built. When a compiler comes across angle brackets it looks in standard library locations first, quotes tells it to only use standard libraries as a last resort. Using quotes and angle brackets properly, can reduce compile time because the compiler finds includes faster. (I will also say that there are ways to override this behavior, however, in the time I have spent looking at the Psi's build files I haven't seen such modifications so I suspect Psi is using standard build behavior. If it is not, I am fully willing to eat this rant.) End rant.

Anyways, back to the code example. Note that all of the code above is completely MPI safe even though nowhere did we do something like if(WorldComm->me()==0)outfile->printf(...); (WorldComm will be explained in WorldComm Object. Here it is just checking which MPI task we currently are.). Also note that we are not flushing the stream anywhere. Both stream flushing and MPI safeness are being taken care of for us, by the OutFile object (if printing slows Psi4 down terribly, the flush rate may need tweaked as it is currently flushing quite often). As a casual developer, this is all you need to do to print the outfile.

Lastly, it is worth mentioning that a large fraction of file pointers have been removed from the source code and I am going to encourage people not to bring them back. In general a file should be read to memory and then closed in order to avoid having too many files open at one time. A problem that is likely to become an issue as more and more tasks are opening and closing files. Previously, many functions allowed for people to pass a file pointer in, which was in turn where that function printed. In principal this could lead to a lot of open files. My observation is that over 95% of such calls were just passing the output file; however, with the output file being a global variable, there is no such need to pass it around.

It's also worth noting, that because of how the new printer object works, you can not pass it a file pointer to specify which file to print to (you tell it the name of the file it should open). Consequentially, I swapped many of the file pointers out for std::string objects instead [for easy identification of which functions I did this to I renamed the argument to std::string OutFileRMR, which I suspect will have no conflict with any name used anywhere else in Psi and can be easily searched for (there are a lot of instances though)]. At the base of any routine that was accepting a file pointer originally, I now have a ternary operator that checks if that string is equal to "outfile", if it is the output is printed to the current output file, otherwise a file with that name is opened. Right now the only file that should be remaining open is the output file.

## WorldComm Object [Return to top](#Top)

Psi4 previously included a WorldComm object; however, that object was not in widespread use and it lacked quite a bit of basic MPI support, notably support for sub-communicators (to be explained below). As with the printer object, WorldComm is actually a hierarchy of objects, but this time it's a two level hierachy. The base level is a class called Parallel and looks something like this:

template<typename DerivedType>
class Parallel{
   public:
       void sync(const string& CommName="NONE)const;
       
       template<typename T>
       void all_gather(const T* data, const int nelem, T* target, const string& CommName="NONE")const;
       
       template<typename T>
       void bcast(T* data, const int nelem, const int bcaster, const string& CommName="NONE")const;

       /* Returns the identity of the current MPI task*/
       int me(const string& CommName="NONE")const;

       /* Returns the number of MPI tasks in the current communicator*/
       int nproc(const string& CommName="NONE)const;

       /* Returns the current communicator*/
       string communicator() const;

       void MakeComm(const string& NewName, const int Color, const string& CommName="NONE");

       void FreeComm(const string& CommName="NONE);
};

There's a lot going on here so let's take it in steps. The first thing we see is that we have some basic MPI routines (albeit with the names changed, this is what I inherited, otherwise the names would be the same as the MPI ones...) all of which take some argument CommName that defaults to "NONE". MPI defines a group of tasks to be a communicator, the most fundamental of which is called MPI_COMM_WORLD and represents all of the tasks that a program was given at execution. For various reasons, which we get into below, one may wish to split MPI_COMM_WORLD into what are called sub-communicators. A sub-communicator is a subset of the MPI_COMM_WORLD communicator, meaning it may have anywhere from 1 to the total number of MPI tasks; creating sub-communicators is discussed below when we describe why you may want to. For the time being all we need to know is that a (sub-)communicator is the collection of tasks we have available to perform our parallel work for us, hence every MPI call needs to know which tasks are performing the parallel work.

Developers familiar with the previous WorldComm object may know that the communicator was not a variable and instead was hard-coded to MPI_COMM_WORLD, but that is a very bad idea and all developers should never assume that they are working with MPI_COMM_WORLD from now on. So how does one figure out what communicator they actually are working with? Call string communicator(), which will return the name of the current communicator; however, if the developer just wants to operate using the current communicator, then all you have to do is not provide a communicator (or equivalently pass "NONE", but the idea is that the default argument of "NONE" tells WorldComm to use the current communicator). Again we follow up with communicators in much more detail later and defer further conversation till then.

As for the non-obvious routines that are available, we have:

  • sync(const string& CommName), which stops program execution at this call until all tasks that are part of CommName reach this line of code. It is essential that all tasks belonging to CommName can actually reach this part of code. If one or more tasks can never reach this sync call, then the program will sit here indefinitely waiting for those tasks. Be very careful when using this call.

  • void all_gather() takes a vector that is distributed (blocks of it are on each MPI task) and combines it into one vector, and each process get's a copy of the resulting vector. Because it calls Boost's all_gather implementation each block must be the same size. This will not be possible if the amount of data is not divisible by the number of tasks. In such case, the remainder will need to be broadcast separately. Perhaps in the future we can address this problem within the all_gather call, but for the moment this is something to be aware of.

  • bcast() this is the call that broadcasts the data, from one MPI task, to all other MPI tasks. Note that all MPI tasks on the communicator need to call it (not just the broadcaster, and not just receivers, everyone)

Currently these are all of the MPI routines that have been needed, but if for example someone wants an all_reduce this would be the place to add it. This brings us to the next topic, how WorldComm actually works. As of right now Psi4 users, are not obligated to have MPI, thus the code needs to function both with MPI and without MPI. For this reason Parallel is actually an "abstract base class" (quotes to be explained shortly). From Parallel two other classes are derived LocalCommWrapper and MPICommunicator (again naming choices were inherited and are not mine, I hate these enough that I may change them though...). As you may guess LocalCommWrapper implements all of the MPI functions for the case that there is only one MPI task. This means that it really does a whole lot of nothing (there's not actually any data to move...).

MPICommunicator on the other hand contains quite a bit of actual code. At this point note that the MPI calls are templates, i.e. I can pass an array of doubles to bcast() just as easy as I can pass an array of strings. For people familiar with the underlying MPI calls you know that one of the arguments to each MPI call is the type (double, char, float, etc.) of the data that is being messaged. The Boost libraries provide an MPI extension that automatically take care of passing the data to the underlying MPI calls for us. In fact Boost knows how to deal with passing any vectorizable (the data is stored sequentially) object to MPI. If additional calls are added, it's important they are added in terms of the Boost MPI calls to preserve this nice templating feature. In particular note that passing all data as "raw" data to MPI is not an acceptable solution because it is not portable (different endian-ness is handled by MPI if it knows the type, but it can't adjust raw data for you).

Warning Expert Content!! Now we come back to the air quotes around abstract-base class. I've now told you that Parallel's MPI functions are both virtual (overriden by LocalCommWrapper and MPICommunicator) and templated. I lied; they are only templated. In fact C++ prohibits a templated, virtual function (a compiler can't actually make usable code because virtual means the function needs to be figured out at runtime, templated means it needs to be figured out at compile time...). The illusion of virtual-ness is maintained by having each derived class implement a function call XImpl() that has the same signature as X() in the base class, where X is an MPI function like bcast(), sync(), etc.. You may have noticed that Parallel is a template class, and we've totally ignored this fact up until this point. The DerivedType parameter is actually the type of the base class and can then be used to "upcast" the base to the derived class, in turn calling the correct function. This "trick" is known as the "curiously recursive template patter" for those wanting to know more. End Expert Content!!!

If you are just trying to code up other MPI routines looking at bcast(), for example, should show you what you need to do. Just switch bcastImpl() for whatever routine you are coding up and define an XImpl() for both LocalCommWrapper and MPICommunicator. Lastly note if you were to define bcast() and bcastImpl() for MPICommunicator but not LocalCommWrapper the compiler will complain that LocalCommWrapper does not have a member bcastImpl() and fail to compile. This is how we enforce a common interface for both our local and MPI implementations. A corollary to this is that all functions that are meant to be called should be defined in the Parallel object and not the derived classes. Only the implementation should be defined in the derived classes and it should be private!!!!!!

Finally, the last thing we need to talk about is creating communicators and why we would want to do this. Imagine that our SCF routine was MPI parallelized (it will be soon...) and imagine we don't have gradients for it. One can imagine wanting to do a finite difference calculation to obtain the gradient of the energy. Finite difference calculations fall under the realm of "embarrassingly parallel" calculations, basically any calculation that subsists of a bunch of small, independent calculations. One can imagine wanting to write an MPI routine that parallelizes these calculations. Under a typical supercomputer allocation you probably have less MPI tasks then you do finite difference calculations, so a skeletal finite difference routine looks like:

/* Number of MPI Tasks*/
int ntasks=WorldComm->nproc();

/* Central difference has three calculations per atomic degree of freedom */
int ncalcs=NAtoms*9;

/* The number of calcs each MPI task is responsible for (assuming ncalcs modulo ntasks =0 for simplicity)*/
int batchsize=ncalcs/ntasks;

/* Get my task number*/
int me=WorldComm->me();

/* Get my loop limits */
int batchstart=me*batchsize;
int batchend=(me+1)*batchsize;

for(int calc=batchstart;calc<batchend;++calc){

   /* Do set-up for this calc*/

   /* Run SCF, get energy*/

   /* Do clean-up/ update component of gradient */

}

/* Gather the gradient up from each task*/
WorldComm->all_gather();

A priori you may think this is all well and good, but there's a problem. The SCF routine is parallelized with MPI too. When it goes and sets up it's parallelization it's going to think it also has ntasks to work with. Even worse, if the SCF routine involves a synch call then the program will hang indefinitely because only one task can get to each barrier. Technically, this problem is also present for OpenMP as well, but OpenMP allows a simple work around by allowing one to turn off nested parallelism (OpenMP directives calling for additional parallelism, within other OpenMP parallel directives are ignored).

MPI, thankfully, built support into the standard via what are known as subcommunicators. Returning to the above example, each MPI process available to our finite difference routine is handling it's own SCF call, so we need to split the current communicator (obtained via WorldComm->communicator()) into NTask subcommunicators, such that there is one MPI task per subcommunicator. The above code becomes:

/* Number of MPI Tasks*/
int ntasks=WorldComm->nproc();

/* Central difference has three calculations per atomic degree of freedom */
int ncalcs=NAtoms*9;

/* The number of calcs each MPI task is responsible for (assuming ncalcs modulo ntasks =0 for simplicity)*/
int batchsize=ncalcs/ntasks;

/* Get my task number*/
int me=WorldComm->me();

/* Get my loop limits */
int batchstart=me*batchsize;
int batchend=(me+1)*batchsize;

/* What you call the comm is arbitrary*/
string NewCommName="OurNewComm";

/* For reference*/
string OldComm=WorldComm->communicator();

/* Third argument defaults to current comm*/
WorldComm->MakeComm(NewCommName,me);

for(int calc=batchstart;calc<batchend;++calc){

   /* Do set-up for this calc*/

   /* Run SCF, get energy*/

   /* Do clean-up/ update component of gradient */

}

/* Note that the all_gather call is over OldComm, not NewComm*/
WorldComm->all_gather(...,....,...OldComm);

/* After this call OldComm is the current Comm*/
WorldComm->FreeComm(NewComm);

There are a couple things to note. Most importantly, you need to free the communicator up!!!!! If you don't free the communicator the remainder of the code will execute under reduced parallelization. If you call MakeComm, you need to call FreeComm. The second thing to note is our MPI code is a bit more complicated now because we have to worry about which communicator things are over, in particular they are not always over the current communicator anymore (for example the all_gather call). When you make a subcommunicator, for all intents and purposes, that is the communicator you are passing to all subfunctions; therefore you want to operate on subcommunicator's parent communicator (what we called OldComm). Note that in the example above if we would have put the FreeComm call above the all_gather call we would have actually been ok using the current comm, but in general there will be MPI calls, at the current level between the Make and Free calls; this is what I tried to demonstrate.

The last thing worth mentioning about this example is the syntax for making a subcommunicator. The second argument is what the MPI standard calls the "color" of the current MPI task. When splitting a communicator, each task that is part of that communicator must cross that call (i.e. if tasks 0 and 1 are becoming a subcommunicator, tasks 2+ must also cross the Make call) and must provide a color. Communicators are assigned to subcommunicators based on their color. The number of different colors dictates the number of subcommunicators that get formed. In the example above, I passed the task ID as the color. Since the task ID is unique to each task, each task was assigned to it's own subcommunicator. When each SCF call, goes to do its parallelization it will be doing so on a communicator consisting of 1 task.

Hopefully from this subcommunicator discussion you can see that you need to split the communicator anytime you call a function that may contain additional MPI parallelization. Note, that just because the function you are calling doesn't currently have MPI parallelization doesn't mean it won't in the future.

## What an MPI developer needs to know [Return to top](#Top)

An MPI developer needs to be familiar with the variety of topics addressed on this page as well as general MPI programming paradigms. Moving forward, I suspect that the biggest issue that will come up is the the spawning new communicators issue as I suspect many developers are not used to doing this. Because Psi4 is such a big program it is essential that we use communicators to manage how many tasks are available to a routine, rather then always assuming we have the maximum like most programs can do.

## What an non-MPI developer needs to know [Return to top](#Top)

For the most part these changes are intended to be completely transparent to a non-MPI developer other than the syntax for printing has changed slightly. If you use the new OutFile and InFile objects, MPI is other people's problem not yours. The other thing to note is that if you are not using MPI, there is absolutely zero reason whatsoever to call WorldComm. If there is no MPI dependence to your code, it makes no difference whether the tasks go through it together and synced or asynchronously. One of the biggest problems I saw previously was people setting up MPI Barriers when they didn't need them. Setting up a barrier incorrectly can cause the code to hang indefinitely because some tasks will not reach the barrier if your code is called from above in parallel. Although, the WorldComm object is setup in such a way that spurious barrier calls should now be called on the correct communicator and hanging should be mitigated, if a proper communicator was not constructed then the fail safes in WorldComm can't work.