Skip to content

Commit

Permalink
Details on partial halo exchanges
Browse files Browse the repository at this point in the history
  • Loading branch information
gihanmudalige committed Dec 12, 2013
1 parent 9621341 commit 8eb44ab
Showing 1 changed file with 36 additions and 29 deletions.
65 changes: 36 additions & 29 deletions doc/mpi-dev.tex
Expand Up @@ -564,18 +564,26 @@ \subsection{Halo Exchanges}\label{subsec/exchange}

\subsection{Partial Halo Exchange}\label{sec/partialhalo}

The halo exchange for a given \texttt{op\_set} will trigger an exchange of the halo elements in all the
\texttt{op\_dat}s defined on this set, for \underline{all} the elements in the halo. The reason is due to OP2 creating
the halo for an \texttt{op\_set} based on all the mapping tables from and to that set (as detailed previously).Therefore
when a parallel loop is executed over a boundary set that has a very sparse connectivity to an internal set, the full
internal set's halo will be exchanged. In some 3D mesh applications the connectivity from boundary set to internal set
is significantly smaller, pointing to a case where a partial halo exchange would be advantageous to gain better
performance. With a partial halo exchange you only need to hide message latency rather than latency plus time to
actually transfer the full halo of data.
The halo exchange for a given \texttt{op\_set} will trigger a exchange of all the halo elements for this set. The reason
is due to OP2 creating the halo for an \texttt{op\_set} based on all the mapping tables from and to that set (as
detailed previously). Therefore when a parallel loop is executed over a boundary set that has a very sparse connectivity
to an internal set, the full internal set's halo will be exchanged. In some 3D mesh applications the connectivity from
boundary set to internal set is significantly smaller, pointing to a case where a partial halo exchange would be
advantageous to gain better performance. With a partial halo exchange you only need to hide message latency rather than
latency plus time to actually transfer the full halo of data.

As a result, a partial halo exchange mechanism has now been implemented, where based on the mapping table that
determines the connectivity between the sets, only the halo elements related to this map is exchanged. The same halo
structs are used as before, but now a per map halo is created in \texttt{op\_halo\_permap\_create()}. This per map halo
is exchanged if the total number of (global) mapping table entries (for this map) that references foreign elements over
a partition boundary is less than 30\% of the total size of the halo for the exchanged set.

Thus for example, assume that a map exists from boundary edges to internal nodes (\texttt{bedges\_to\_nodes}) and a
mapping exists between internal edges to internal nodes (\texttt{edges\_to\_nodes}). If the number of halo elements due
to \texttt{bedges\_to\_nodes} is less than 30\% of the halo elements due to both \texttt{bedges\_to\_nodes} and
\texttt{edges\_to\_nodes}, then a partial halo will be exchanged only using the per map halo of
\texttt{bedges\_to\_nodes}, in a loop over \texttt{bedges} that indirectly accesses internal nodes.

A partial halo exchange mechanism is implemented to support this, where based on the mapping table that determines the
connectivity between the sets only the halo elements related to this map is exchanged. A per map halo is created and
stored in \texttt{op_halo_permap_create()}.


% global operations
Expand Down Expand Up @@ -605,21 +613,6 @@ \subsection{Fetching Data}\label{subsec/putfetch}
will replace the internal \texttt{op\_dat}'s data values. A valid implementation will need to translate the original set
element index to the current set element index.


% performance measurements - including flags to trigger them
\subsection{Performance Measurements}\label{subsec/perf}

For measuring the execution time of code two timer routines are implemented. Firstly, \texttt{op\_timers \_core()} (in
\texttt{op\_lib\_core.c}) measures the elapsed time on a single MPI process while \texttt{op\_timers()}
(in \texttt{op\_mpi\_decl.c}) has an implicit \texttt{MPI\_Barrier()} so that time across the whole MPI universe can be
measured. \noindent The time spent in the \texttt{op\_par\_loop()} calls is measured and accumulated. The setup costs
due to halo creation and partitioning are also measured and the maximum on all the processors is printed to standard out
by rank 0. Additionally information about the amount of MPI communications performed is also collected. For each
\texttt{op\_par\_loop()} we maintain a struct that holds (1) the accumulated time spent in the loop (2) the number of
times the \texttt{op\_par\_loop()} routine is called, (3) the indices of the \texttt{op\_dat}s that requires halo
exchanges during the loop, (4) the total number of times halo exchanges are done for each \texttt{op\_dat} and (5) the
total number of bytes exported for each \texttt{op\_dat}.

\begin{figure}[t]\small
\vspace{-0pt}\noindent\line(1,0){470}\vspace{-10pt}
\begin{pyglist}[language=c]
Expand All @@ -644,6 +637,20 @@ \subsection{Performance Measurements}\label{subsec/perf}
\normalsize\vspace{-0pt}\label{fig:perf}
\end{figure}

% performance measurements - including flags to trigger them
\subsection{Performance Measurements}\label{subsec/perf}

For measuring the execution time of code two timer routines are implemented. Firstly, \texttt{op\_timers \_core()} (in
\texttt{op\_lib\_core.c}) measures the elapsed time on a single MPI process while \texttt{op\_timers()}
(in \texttt{op\_mpi\_decl.c}) has an implicit \texttt{MPI\_Barrier()} so that time across the whole MPI universe can be
measured. \noindent The time spent in the \texttt{op\_par\_loop()} calls is measured and accumulated. The setup costs
due to halo creation and partitioning are also measured and the maximum on all the processors is printed to standard out
by rank 0. Additionally information about the amount of MPI communications performed is also collected. For each
\texttt{op\_par\_loop()} we maintain a struct that holds (1) the accumulated time spent in the loop (2) the number of
times the \texttt{op\_par\_loop()} routine is called, (3) the indices of the \texttt{op\_dat}s that requires halo
exchanges during the loop, (4) the total number of times halo exchanges are done for each \texttt{op\_dat} and (5) the
total number of bytes exported for each \texttt{op\_dat}.

Currently, the only way to identify a loop is by its name. Thus we use a hash function to compute a key corresponding to
the \texttt{op\_mpi\_kernel} struct (\figurename{~\ref{fig:perf}}) for each loop and store it in a hash table.
Monitoring the halo exchanges require calls to the \texttt{op\_mpi\_perf\_comm()} (defined in \texttt{op\_mpi\_core.c})
Expand Down Expand Up @@ -834,9 +841,9 @@ \subsection{Hybrid CPU/GPU Execution}\label{sec/hybrid}

\section{To do list}
\begin{itemize}
\item Hybrid CPU/GPU - to write in doc
\item Partial halo exchanged - to write in doc
\item Mesh renumbering - to write in doc
\item Need documentation for Hybrid CPU/GPU
% \item Need documentation for Partial halo exchanged
\item Need documentation for Mesh renumbering
\item Implement automatic check-pointing over MPI
\end{itemize}

Expand Down

0 comments on commit 8eb44ab

Please sign in to comment.