OpenCAL-OMP.tex

\chapter{OpenCAL OpenMP version}\label{ch:opencal-omp}

OpenCAL-OMP is the parallel OpenMP-based implementation OpenCAL, able
to expoit all the processing elements on a shared memory
machine. OpenCAL-OMP main structure and statements convention remain
unchanged with respect to OpenCAL. Moreover, similarly to the serial
version, OpenCAL-OMP allows for some \emph{unsafe operations}, which
can significantly speed up the application. However, the utmost
attention must be paid to avoid\textsl{race condition} issues when
unsafe operations are considered\footnote{For instance, when many
  threads perform concurrent operations on the same memory locations
  and such operations are made by more than one atomic machine
  instruction, it can happen that they can interleave, giving rise to
  wrong (i.e., non consistent) results. Furthermore, even in the case
  of atomic operations, the logic order of execution could not be
  respected. Thus, for instance, a read-write logic sequence of atomic
  operations can actually become a write-read (wrong) sequence due to
  the fact that the thread performing the write operation is executed
  first.}. In the following Sections we will introduce OpenCAL-OMP by
examples, highlighting differences with respect to the serial
implementations presented in Chapter \ref{ch:opencal}. In particular,
Game of Life is firstly implemented. Subsequently, four different
implementations of SciddicaT are illustrated, to show how simulation
efficiency can be improved. The implementation of a simple 3D CA is
also presented. The last part of the Chapter deals with OpenCAL-GL and
shows how to integrate a basic OpenGL/GLUT visualization system in
both 2D and 3D applications based on OpenCAL-OMP.


\section{Conway's Game of Life in OpenCAL-OMP}

In Section \ref{sec:cal_life}, we shown a possible OpenCAL
implementation of Conway's Game of Life. Here, we present an
OpenCAL-OMP implementation of the same cellular automaton, by
discussing the differences with respect its serial implementation. The
complete source code can be found in Listing \ref{lst:calomp_life}.

\lstinputlisting[float,floatplacement=H, label=lst:calomp_life, caption=An OpenCAL-OMP implementation of the Conway's game of Life.]{../opencal-examples/OpenCAL-OMP/calomp_life/source/life.c}

As can be seen, the OpenCAL-OMP implementation of $Life$ is almost
identical to the serial one thanks to the seamless parallelization
adopted by the library. The only differences can be found at lines 3-5
where, instead of including the OpenCAL header files, the OpenCAL-OMP
headers can be found. All the remaining source code is unchanged. In
this specific case, besides considering the OpenCAL-OMP header files
instead of the OpenCAL ones, no differences do exist between serial
and parallel source codes.


\section{SciddicaT}

In this Section, the OpenCAL-OMP implementations of the four SciddicaT
versions presented in Chapter \ref{ch:opencal} are shown.

\subsection{SciddicaT naive implementation}

As for the case of Conway's Game of Life, even the OpenCAL-OMP naive
implementation of the SciddicaT cellular automaton, namely
$SciddicaT_{naive}$, shown in Listing \ref{lst:calomp_sciddicaT}, does
not significantly differ from the serial implementation (cf. Section
\ref{sec:sciddicaT}). As for the $Life$ example, differences only
regard the included headers (lines 3-5), while the remaining source
code is unchanged.

\lstinputlisting[label=lst:calomp_sciddicaT, caption=An OpenCAL-OMP implementation of the $SciddicaT_{naive}$ debris flows simulation model.]{../opencal-examples/OpenCAL-OMP/calomp_sciddicaT/source/sciddicaT.c}

\subsection{SciddicaT with active cells optimization}
The OpenCAL-OMP implementation of $SciddicaT_{ac}$, which takes
advantage of the built-in OpenCAL active cells optimization, can be
found in Listing \ref{lst:calomp_Sciddicat-activecells}. The
corresponding serial implementation can be found in Section
\ref{sec:sciddicaT_active}.

\lstinputlisting[label=lst:calomp_Sciddicat-activecells, caption=An OpenCAL-OMP implementation of the SciddicaT debris flows simulation model with the active cells optimization.]{../opencal-examples/OpenCAL-OMP/calomp_sciddicaT-activecells/source/sciddicaT.c}

Differently from the $SciddicaT_{naive}$ implementation shown in
Listing \ref{lst:calomp_sciddicaT}, unsafe operations are here
exploited due to the active cells optimization. Indeed, the
\verb'calAddActiveCellX2D()' function, called at line 87, adds a cell
belonging to the neighborhood to the set $A$ of active cells. As
evident, more threads can try to add the same cell to the same
location of the set $A$ at the same time, by giving rise to a race
condition\footnote{Actually, a thread-safe implementation could also
  be considered for $SciddicaT_{ac}$, becouse the active cells
  add/remove operations update 8 bit-long flag values in a working
  array transparently managed by OpenCAL-OMP, which are atomic
  operations. Unsefe operations are anyway here introduced for
  illustrative purposes.}. In order to avoid race condition issues,
OpenCAL-OMP is able to \emph{lock} memory locations involved in
concurrent non-atomic operations so that each thread can complete its
own task without the risk other threads interfere. In order to do
this, it is sufficient to place OpenCAL-OMP in \emph{unsafe} state, by
calling the \verb'calSetUnsafe2D()' function, as done at line
163. Unsafe functions are provided by the \verb'cal2DUnsafe.h'
OpenCAL-OMP header file (line 6). No other modifications to the serial
source code are required.

\subsection{SciddicaT with direct neighbors update}
The $SciddicaT_{ac+dnu}$ OpenCAL-OMP implementation of SciddicaT,
which takes advantage of both the active cells optimization and the
direct neighbors update, is shown in Listing
\ref{lst:calomp_SciddicaT-unsafe}. The corresponding serial
implementation can be found in Section \ref{sec:sciddicaT_extended}.

\lstinputlisting[label=lst:calomp_SciddicaT-unsafe, caption=An OpenCAL-OMP implementation of the $SciddicaT_{ac+dnu}$ debris flows XCA simulation model with direct neighbors update.]{../opencal-examples/OpenCAL-OMP/calomp_sciddicaT-unsafe/source/sciddicaT.c}

As for the serial implementation, only topographic altitude and debris
thickness are considered as substates (lines 25-28, 147-148), since
outflows substates are no longer needed. In fact, mass balance is here
obtained by adding outflows to the neighbors and subtracting them from
the central cell, as soon as they have been computed, in the $next$
computing plane, which therefore acts as an accumulation
layer. Moreover, only two elementary processes are now necessary
(cf. lines 143-144), instead of the three considered for the previous
versions of SciddicaT.

Besides the active cells optimization, even the direct neighbors
update requires unsafe opetations. For this purpose, the
\verb'cal2DUnsafe.h' header file is included at line 6, and the
\verb'calSetUnsafe2D()' function called at line 139 to places
OpenCAL-OMP in unsafe mode and avoid race condition issues. Besides
the already discussed \verb'calAddActiveCellX2D()' function, the
\verb'calAddNext2Dr()' and \verb'calAddNextX2Dr()' unsafe functions
are here employed (lines 88-89), in place of the combination of
get-set operations, as done in the corresponding serial implementation
(Listing \ref{lst:cal_sciddicaT-unsafe}, lines 84-85). In fact, let's
consider the source code snippet in Listing \ref{lst:get-set} (checked
out by Listing \ref{lst:cal_sciddicaT-unsafe}). As it can be seen, for
each not-eliminated cell, the algorithm computes a flow, $f$ (line 5)
and then subtracts it from the central cell (line 6), adding it to the
corresponding neighbour (line 7), in order to accomplish mass
balance. In both cases (flow subtraction and adding), a flavor of
\verb'calGet' function is called to read the current value of the
$Q_h$ substate from the next working plane. Subsequently, a flavor of
the \verb'calSet' function is used to update the previously read
value. When a single thread is used to perform such operations, no
race conditions can obviously occur. At the contrary, even in the case
of two concurrent threads, different undesirable situations can take
place, which give rise to a race condition and therefore to a wrong
result. For instance, let's suppose both threads read the value first,
and then write their updated values; in this case, the resulting value
will correspond to the one written by the thread that writes the value
for last, and the contribution of the other thread is lost. In order
to avoid such kind of problems when dealing with more threads, the
above mentioned \verb'calAddNext2Dr()' and \verb'calAddNextX2Dr()'
functions have to be used, since they lock the cell under
consideration and then perform the get-set operations without the risk
that other threads can interfere. In this way, no race conditions can
be triggered. Obviously, there is a side-effect in terms of
computational performance. In fact, as expected, locks can slow down
threads execution and therefore the entire simulation.

\begin{lstlisting}[float,floatplacement=H, label=lst:get-set, caption=Example of non atomic operation made of a combination of get-set calls.]
  // <snip>
  for (n=1; n<sciddicaT->sizeof_X; n++)
  if (!eliminated_cells[n])
  {
    f = (average-u[n])*P.r;
    calSet2Dr (sciddicaT,Q.h,i,j,  calGetNext2Dr (sciddicaT,Q.h,i,j)  -f);
    calSetX2Dr(sciddicaT,Q.h,i,j,n,calGetNextX2Dr(sciddicaT,Q.h,i,j,n)+f);
    // <snip>
  }
  // <snip>
\end{lstlisting}


\subsection{SciddicaT with explicit simulation loop}

As for the serial version, also for the OpenMP based release of
OpenCAL it is further possible to improve computational performances
of SciddicaT by avoiding unnecessary substates updating. As already
reported, the \verb'calRun2D()' function used so far to run the
simulation loop updates all the defined substates at the end of each
elementary process. However, in the specific case of the SciddicaT
model, no substates updating should be executed after the application
of the second elementary process, as this just removes inactive cells
from the set $A$.

Listing \ref{lst:calomp_sciddicaT-explicit} presents the
$SciddicaT_{ac+dnu+esl}$ OpenCAL-OMP implementation of SciddicaT,
based on an explicit global transition function, besides active cells
optimization and direct neighbors update. The explicit global
transition function is defined by means of
\verb'calRunAddGlobalTransitionFunc2D()'. It registers a callback
function within which it can be used to both reorder the sequence of
elementary processes to be applied in the generic computational step,
and also to perform only necessary substates updates. The
$SciddicaT_{ac+dnu+esl}$ implementation presented in Listing
\ref{lst:calomp_sciddicaT-explicit} also makes the simulation loop
explicit and defines a stopping criterion for the simulation
termination.

\lstinputlisting[label=lst:calomp_sciddicaT-explicit, caption=An
  OpenCAL-OMP implementation of the SciddicaT debris flows
  simulation model with explicit simulation
  loop.]{../opencal-examples/OpenCAL-OMP/calomp_sciddicaT-unsafe-explicit/source/sciddicaT.c}

\subsection{SciddicaT computational performance}

Table \ref{tab:speedup} resumes computational performance of all the
above illustrated SciddicaT implementations as implemented in
OpenCAL-OMP. The considered case of study is the simulation of the
Tessina landslide shown in Figure \ref{fig:sciddicaT}, which required
a total of 4000 computational steps. The adopted CPU is a Intel Core
i7-4702HQ @ 2.20GHz 4 cores (8 threads) processor, already considered
for the performance evaluation of the corresponding serial SciddicaT
implementations described in Chapter \ref{ch:opencal}. Results are
provided both in terms of elapsed time and speedup with respect to
the corresponding serial version. Elapsed times of the serial
simulations are also reported.

\begin{table}
  \centering
  \footnotesize
  \begin{tabular}{l|c|c|c|c|c|c}
    \hline
    Version & Serial [s] & 1thr & 2thr & 4thr & 6thr & 8thr\\
    \hline
    \hline
    $SciddicaT_{naive}$     & 240s & 0.82 (293s) & 1.22 (196s) & 1.53 (157s) & 1.64 (146s) & 1.6 (150s)\\
    $SciddicaT_{ac}$       & 23s  & 0.77 (30s)  & 1.36 (17s)  & 1.77 (13s)  & 2.09 (11s)  & 2.3 (10s)\\
    $SciddicaT_{ac+dnu}$    & 13s  & 0.77 (17s)  & 1.86 (7s)   & 2.6  (5s)   & 2.17  (6s)  & 2.6 (5s)\\
    $SciddicaT_{ac+dnu+esl}$ & 12s  & 0.75 (16s)  & 1.2  (10s)  & 2.4  (5s)   & 2.4  (5s)   & 3.0 (4s)\\
    \hline
  \end{tabular}
  \caption{Speedups and elapsed times (in brackets) of the four
    different OpenCAL-OMP implementations of the SciddicaT debris
    flows model. Elapsed times of the corresponding serial versions
    are reported in the second column for comparison.}
  \label{tab:speedup}
\end{table}

As noted, results are quite good. In particular, the better
results in terms of speed up were obtained for the fully optimized
SciddicaT implementation (i.e., with the explicit substate updating
feature), which runs 3 times faster than the corresponding serial
version when executed over 8 threads. Nevertheless, consider that the
SciddicaT simulation model here adopted is quite simple and better
performance in terms of speed up can certainly be obtained for CA
models with more complex transition functions and more extended
computational domains.

Eventually, note how progressive optimizations can considerably
reduce the overall execution time. In fact, if for the naive (i.e., non
optimized at all) serial implementation the elapsed time was 240s, for
the fully optimized parallel version the simulation lasted only 3
seconds, corresponding to a speed up value of 80, i.e. the fully
optimized parallel version runs 80 times faster than the serial naive
implementation.

%% \begin{table}
%%   \centering
%%   \begin{tabular}{l|c|c|c|c|c|c}
%%     \hline
%%     S3-hex version & Serial [s] & 1thr & 2thr & 4thr & 6thr & 8thr\\
%%     \hline
%%     \hline
%%     naive         & 1030s & 0.52 (1982s) & 0.9 (1142s) & 1.03 (998s) & 1.13 (913s) & 1.3  (781s)\\
%%     active cells  & 55s   & 0.86 (64s)   & 1.57 (35s)  & 2.75 (20s)  & 2.5  (22s)  & 3.06 (18s)\\
%%     eXtended      & 27s   & 0.87 (31s)   & 1.42 (19s)  & 2.7  (10s)  & 2.46 (11s)  & 3.38 (8s)\\
%%     explicit loop & 16s   & 0.8  (20s)   & 1.33 (12s)  & 2.67 (6s)  & 2.29  (7s)   & 3.2  (5s)\\
%%     \hline
%%   \end{tabular}
%%   \caption{Speedup of the four different
%%     implementations of the SciddicaS3hex debris flows model accelerated by OpenMP.}
%%   \label{tab:speedup}
%% \end{table}

%% \begin{table}
%%   \centering
%%   \begin{tabular}{l|c|c|c}
%%     \hline
%%     mbusu version & Serial [s] & OpenMP 6th & OpenCL (Quadro FX 1100M)\\
%%     \hline
%%     \hline
%%     naive         & 7796s & 3.57 (2185s) & 3.52 (2213s)\\
%%     \hline
%%   \end{tabular}
%%   \caption{mbusu Speedup.}
%%   \label{tab:speedup}
%% \end{table}

%% \begin{table}
%%   \centering
%%   \begin{tabular}{l|c|c}
%%     \hline
%%     Threads & Elapsed time [s] & Speedup\\
%%     \hline
%%     \hline
%%     1       & 7308             & 1     \\
%%     2       &                  &       \\
%%     4       &                  &       \\
%%     6       & 2185             & 3.34  \\
%%     8       &                  &       \\
%%     \hline
%%   \end{tabular}
%%   \caption{mbusu Speedup.}
%%   \label{tab:speedup}
%% \end{table}


\section{A three-dimensional example}
In Section \ref{sec:mod2}, we described the \emph{mod2} 3D CA and
shown a possible OpenCAL implementation. Here, we briefly present an
OpenCAL-OMP implementation of the same cellular automaton in Listing
\ref{lst:calomp_life}, by discussing the differences with respect the
corresponding serial implementation.

\lstinputlisting[float,floatplacement=H, label=lst:calomp_mod2,
  caption=An OpenCAL-OMP implementation of the mod2
  CA.]{../opencal-examples/OpenCAL-OMP/calomp_mod2CA3D/source/mod2CA3D.c}

As seen, the \emph{mod2} OpenCAL-OMP implementation is almost
identical to the serial one. As for the case of Game of Life, the only
differences can be found at lines 3-5 where OpenCAL-OMP headers can be
found instead of the OpenCAL onces. All the remaining source code is
unchanged.

\section{OpenCAL-GL and global functions}

As for OpenCAL, it is possible to exploit OpenCAL-GL to have a simple
visualization system by adding few lines of code to your
application. Combining OpenCAL-OMP and OpenCAL-GL does not differ from
what we have done in Section \ref{sec:combining_gl} for OpenCAL and
OpenCAL-GL. Therefore, please refer to this section for major details.

Similarly, you can also use the same global reduction functions
described in Section \ref{sec:redution} in OpenCAL-OMP, by considering
the OpenCAL-OMP \verb'cal2DReduction.h' and \verb'cal3DReduction.h'
specific header for 2D and 3D CA, respectively. Please refer to this
section for further details.