# Ping-Pong
In this section of the assignment you will be analyzing the latency and bandwidth of MPI communication. You will need to review your **cmake** and **queue** labs.

Begin by looking at the pingPong code dissuced in class and shown below.

In [3]:
%%writefile src/pingpong.c
/*******************************************************************************
 FILE : pingpong.c
 
 LAST MODIFIED : $Date: 2006/02/24 21:56:11 $

 REVISION : $Revision: 1.1 $
 
 DESCRIPTION : MPI ping-pong program, illustrating sending and receiving of
   mesages between two processors. 

 COPYRIGHT :

   The copyright in this unpublished software is the property of 
   Stanford University.

 AUTHORS : John Martin Bodley
           Stanford Center for Computational Earth and Environmental Science
           Stanford University
           Stanford, CA 94305      

==============================================================================*/  

#include "mpi.h"

int main(int argc, char *argv[]) {
/*******************************************************************************
 FUNCTION : main
  
 LAST MODIFIED : Feb 24, 2006
 
 DESCRIPTION : MPI send and receive of data between two processors.
 
==============================================================================*/
  
  char processor_name[MPI_MAX_PROCESSOR_NAME];
  int inmsg, outmsg, namelen, numprocs, rank, dest, source, size, tag = 1;
  double startwtime, endwtime, time;
  
  MPI_Status status;

  /* MPI initialization */
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Get_processor_name(processor_name, &namelen);

  printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);

  if (rank == 0) { 
    dest = source = 1; 
    outmsg = 17;
  }
  else if (rank == 1) { 
    dest = source = 0; 
    outmsg = 19;
  }

  /* Start MPI timing */
  if (rank == 0) startwtime = MPI_Wtime();

  if (rank == 0) {
  
    MPI_Send(&outmsg, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
    MPI_Recv(&inmsg, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
  
  } else if (rank == 1) {
  
    MPI_Recv(&inmsg, 1, MPI_INT, source, tag, MPI_COMM_WORLD, &status);
    MPI_Send(&outmsg, 1, MPI_INT, dest, tag, MPI_COMM_WORLD);
  }

  /* Elasped time */
  if (rank == 0) {
    endwtime = MPI_Wtime();
    time = 1.0e+6 * (endwtime - startwtime);
    size = sizeof(int);
    
    printf("%8d bytes took %9.0f usec (%8.3f MB/sec)\n", size, time, 
        2.0 * size / time);
  }

  MPI_Finalize();

} /* main */


Overwriting src/pingpong.c


This code sends messages back and forth between two MPI processes. n this section of the assignment you will be analyzing the latency and bandwidth of MPI communication.  You can configure and compile the code using the cell below.


In [4]:
!cd build; rm -rf *; cmake cmake -DMPI_C_COMPILER=/usr/local/openmpi-1.6.2_gcc-5/bin/mpicc -DMPIEXEC_EXECUTABLE=/usr/local/openmpi-1.6.2_gcc-5/bin/mpiexec ../src; make

-- The C compiler identification is GNU 4.4.7
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Found MPI_C: /usr/local/openmpi-1.6.2_gcc-5/lib/libmpi.so;/usr/lib64/libdl.so;/usr/lib64/libm.so;/usr/lib64/librt.so;/usr/lib64/libnsl.so;/usr/lib64/libutil.so;/usr/lib64/libm.so;/usr/lib64/libdl.so  
-- Configuring done
CMake Error at CMakeLists.txt:11 (add_executable):
  Target "pingPong" links to item " -Wl,--export-dynamic" which has leading
  or trailing whitespace.  This is now an error according to policy CMP0004.


-- Generating done
  Manually-specified variables were not used by the project:

    MPIEXEC_EXECUTABLE


-- Build files have been written to: /home/clapp/notebooks/mpi/build
[35m[1mScanning dependencies of target pingPong[0m
[ 50%] [32mBuilding C object CMakeFiles/pingPong.di

To test the latency and bandwidth of the cluster you will need start a MPI job on two different nodes (rather than two different cores of the same machine).  Use the work you did in the **queue** lab to setup up a parallel pingpong job.  The command you will ned to run this example is "/usr/local/openmpi-1.6.2_gcc-5/bin/mpirun ./pingPong".

 Extend the Ping-Pong program to send messages of size $2^n$ where $n = 0,1, \ldots, 20$. To ensure accuracy of timings, loop over 1000 iterations for each $n$ and print the average time per message size and hop. How do bandwidth and latency depend on the message lengths? Make a plot of the results.

Next create a new file pingpong2.cc.  Add the correct rules in the CMakeLists.txt file below to compile and link the code.  Recompute the measurements from (a) with replacing the individual *MPI\_Send()* and *MPI\_Recv()* calls with a single *MPI_Sendrecv()* call. What changes? Make a plot of the results.

In [14]:
%%writefile src/CMakeLists.txt
#TELL WHAT VERSION OF CMAKE WE REQUIRE
cmake_minimum_required(VERSION 2.8 FATAL_ERROR)

#Name our project and language we are going to use
project(mpi-test LANGUAGES C)


find_package(MPI REQUIRED)

#Here is the executable we want to build
add_executable(pingPong pingpong.c)

target_include_directories(pingPong PRIVATE ${MPI_C_INCLUDE_PATH} )
target_compile_options(pingPong PRIVATE ${MPI_C_COMPILE_FLAGS} -O3 )
target_link_libraries(pingPong ${MPI_C_LIBRARIES} ${MPI_C_LINK_FLAGS} )



Overwriting src/CMakeLists.txt


# Scalar Advection Equation

Consider the scalar advection equation
\begin{equation}
  \frac{\partial u}{\partial t} + a \frac{\partial u}{\partial x} = 0 \qquad 0 \le x \le 1
\end{equation}
with boundary conditions $u(0, t) = 0$ and initial profile and $u(x,0)= 1 - (10x - 1)^2$ where $x <= 0.2$ and 0 for other values of x.
The equation describes the propagation of a scalar $u$. Assume that the fluid is moving with a constant velocity $a = 0.08$ in the $x$ direction. The exact solution is given by
$ u(x, t) =  1 - [10(x - at) - 1]^2$ for $ 0 \le (x - at)$ and $0$ at other values of $x$.
Discretizing using forward-time (explict Euler) and backward-space for $a > 0$ gives
\begin{equation}
  \frac{u_j^{n + 1} - u_J^{n}}{\Delta t} + a \frac{u_j^n - u_{j - 1}^n}{\Delta x} + \mathcal{O}(\Delta x, \Delta t) = 0
\end{equation}
Therefore
\begin{eqnarray}
  u_j^{n + 1} &= u_j^n - \frac{a \Delta t}{\Delta x}\left(u_{j}^n - u_{j - 1}^{n}\right)+ \mathcal{O}(\Delta x, \Delta t) \\
  &=  u_j^n - \gamma \left(u_{j}^n - u_{j - 1}^{n}\right) \\
  \label{eqn:discretization}
\end{eqnarray}
where $u_0^{n + 1} = 0$. In the above equation $\Delta x = 1/N$ where $N$ is the number of grid cells and
\begin{equation}
  \gamma = \frac{u\Delta t}{\Delta x}
\end{equation}
For stability the CFL condition requires that $\gamma \le 1$, which implies that the time step $\Delta t$ is bounded by
\begin{equation}

  \Delta t \le \frac{\Delta x}{a}
\end{equation}
Note that for the scalar advection equation, the numerical solution is exact for $\gamma = 1$. The numerical and exact solution for $t = 0$ and $8$ seconds are shown in the cell below.


![solution](http://sep.stanford.edu/sep/bob/teach/gp257/solution.png)

One-dimensional advection equation for forward-time backward-space (FTBS) for $t = 0, \, 8$ with $\Delta x = 0.02$, where - denotes the exact solution and o the numerical solution. Note that $\|u(x, t) - u^h(x, t)\| = 0$.

Below you will find the serial code for solving the one-dimensional advenction equation. You are required to parallelize the spatial domain using MPI, where you only have to modify the *main* and *advention* routines. You will also need to add rules to compile the code to the CMakeLists.txt file found below. Begin by compiling a running the serial version of the code. A good value for T is 8 and a is .08.  For the parallel problem using N 1000000 but start with smaller numbers for the serial version and when testing.

In [13]:
%%writefile src/advection.c
/*******************************************************************************
 FILE : advection.c
 
 LAST MODIFIED : $Date: 2006/02/24 21:56:11 $

 REVISION : $Revision: 1.1 $
 
 DESCRIPTION : One-dimensional advection equation.

 COPYRIGHT :

   The copyright in this unpublished software is the property of 
   Stanford University.

 AUTHORS : John Martin Bodley
           Stanford Center for Computational Earth and Environmental Science
           Stanford University
           Stanford, CA 94305      

==============================================================================*/  

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

/*
Variable Type Definitions
-------------------------
*/

typedef struct {
/*******************************************************************************
 STRUCTURE : plot_solution
  
 LAST MODIFIED : Feb 24, 2005
 
 DESCRIPTION : Discretization structure.
 
==============================================================================*/

  double x, t;

} delta_t;

int advection(int n, double T, double *u, double a) {
/*******************************************************************************
 FUNCTION : advection
  
 LAST MODIFIED : Feb 24, 2006
 
 DESCRIPTION : Solves the one-dimensional advection equation using upwinding
   and explicit Euler discretization.
 
==============================================================================*/

  int i, s, t;
  double gamma, sum, *v, x;
  
  delta_t delta;

  /* Solution at next time step */
  v = (double *) malloc(sizeof(double) * (n + 1));

  /* Spatial discretization */
  delta.x = 1.0 / (double) n;

  /* Time step satisfying CFL condition */
  delta.t = delta.x / a;
  gamma = a * delta.t / delta.x;
  
  /* Boundary conditions u(0, t) */
  u[0] = 0;

  /* Initial conditions (t = 0). There are numerous ways to define the 
     initial solution (without the if statement), however this form is ideal 
     for the MPI implementation
  */
  
  /* Global x coordinate */
  x = delta.x;
  for (i = 1; i < n - 2; i++) {
    
    if (x >= 0.0 && x <= 0.2) {
      u[i] = 1 - pow(10.0 * x - 1.0, 2.0); 
    }
    x += delta.x;
  }
   
  /* Loop over time (indices) */
  s = (int) (T / delta.t);
  for (t = 0; t < s; t++) {
  
    /* Loop over space */
    for (i = 1; i < n; i++) {
      v[i] = u[i] - gamma * (u[i] - u[i - 1]); 
    }
    
    /* Update solution */
    for (i = 1; i < n; i++) {
      u[i] = v[i]; 
    }
  } 
  
  /* Exact solution */
  for (i = 0; i < n; i++) {
    v[i] = 0.0;
  }

  /* Global (x - at) coordinates */
  x = - a * T;
 
  for (i = 0; i <= n; i++) {
    if (x >= 0.0 && x <= 0.2) {
      v[i] = 1 - pow(10.0 * x - 1.0, 2.0); 
    }
    x += delta.x;
  }
  
  /* L-2 Error norm */
  sum = 0.0;
  
  for (i = 0; i < n; i++) {
    sum += pow(v[i] - u[i], 2.0);
  } 
  
  printf("residual error for n = %d is: %e\n", n, sqrt(sum)); 
  
  free(v);
  
  return(1);
   
} /* advection */

int get_args(int argc, char **argv, int *n, double *T, double *a, char **file) {
/*******************************************************************************
 FUNCTION : get_args
  
 LAST MODIFIED : Feb 24, 2006
 
 DESCRIPTION : Parses command line arguments.
 
==============================================================================*/

  int i;
  
  if (argc <= 1) {
  
    printf("Usage:\n\n"
      "%s [-n <number>] [-T <time>] [-a <velocity>] [-o <filename>]\n\n"
      "where:\n"
      "  -n <number>           Number of grid cells\n"
      "  -T <time>             Time interval 0 <= t <= T\n"
      "  -a <velocity>         Constant velocity\n"
      "  -o <filename>         Solution output filename\n",
      argv[0]); 
      return(0);
  }
  
  i = 1;
  
  while (i < argc) {
  
    /* Extract number of grid cells */
    if (!strcmp(argv[i], "-n")) {
      *n = atoi(argv[++i]);
    }
  
    /* Extract time span (0 <= t <= T) */
    if (!strcmp(argv[i], "-T")) {
      *T = atof(argv[++i]);
    }
    
    /* Extract the constant velocity  */
    if (!strcmp(argv[i], "-a")) {
      *a = atof(argv[++i]);
    }
    
    /* Extract the output filename (pointer) */
    if (!strcmp(argv[i], "-o")) {
      *file = (char *) malloc(sizeof(char) * (strlen(argv[i + 1]) + 1));  
      strcpy(*file, argv[++i]);
    }
    
    i++;
  }
 
  if (i != 9) {
  
    printf("-n, -T, -a and -o options must be specified\n");
    return(0);
  }
  
  return(1);
  
} /* get_args */

int main(int argc, char **argv) {
/*******************************************************************************
 FUNCTION : main
  
 LAST MODIFIED : Feb 24, 2005
 
 DESCRIPTION : One-dimensional advection equation.
 
==============================================================================*/

  char *file;
  int i, n;
  double *u, a, T;  
  
  FILE *fid;
  time_t startwtime, endwtime;
  
  /* Parse command line arguments */
  if (!get_args(argc, argv, &n, &T, &a, &file)) {
    return(0);
  }

  /* Allocate data arrays */
  u = (double *) calloc(n + 1, sizeof(double));
 
  startwtime = time(&startwtime);
  
  /* Solve the advection equation */
  advection(n, T, u, a); 
 
  endwtime = time(&endwtime);
  printf("wall clock time = %f\n", difftime(endwtime, startwtime));
 
  /* Output solution to file */
  if (fid = fopen(file, "w")) {
    for (i = 0; i <= n; i++) {
      fprintf(fid, "%lf\n", u[i]);
    }
    fclose(fid);
  }
 
  /* Clean up */
  free(u);

  return(1);

} /* main */


Overwriting src/advection.c


## MPI Version
Create a parallel version named src/advectionP.c. 
Once you've got your parallel version create a parallel PBS job and run the code using 2-32 instances (you can choose how many cores per node).

Break the domain into multiple domains using decomposition for solving the one-dimensional scalar advection equation on $p$ processors. Referring to equation~\ref{eqn:discretization} the solution of the FTBS discretization is of the form $u^{n + 1}_j = f(u_{j - 1}^n, u_j)$, hence at each time step, processor $i$ has to only communicate with processor $i + 1$, which one requires one ghost cell per sub-domain. The use of \texttt{MPI\_PROC\_NULL} simplifies the code for the boundary regions defined on processor 1 and $p$ for $u(0, t)$ and $u(1, t)$ respectively.

It is strongly advised that you test your MPI implementation using a smaller number of grid cells (-N), ensuring that the output is in agreement with the serial solution. Furthermore for $\gamma = 1$, the error norm is $\|u(x, t) - u^h(x, t)\| = 0$, where $u^h(x, t)$ denotes the numerical solution. Verify your MPI solution initially for one 


## Parallel Performance Analysis
In an ideal, if computations carried out on $p$ equal parts, the total execution time will be nearly $1/p$ of the time required by a single processor. Let $t_j$ denote the wall clock time required to execute a task with $j$ processors, then the speedup $S_p$, for $p$ processors is defined as
\begin{equation}
  S_p = \frac{t_1}{t_p}
\end{equation}
where $t_1$ is the time required for the most efficient sequential algorithm. Note that this is not the same as running the MPI code using one one processor. Furthermore the computational efficiency using $p$ processors is defined as
\begin{equation}
  E_p = \frac{S_p}{p}
\end{equation}
where $0 \le E_p \le 1$.


Report and plot your performance as a function of number of cores.