## <center>MPIIO - Distributed File Systems</center>
### <center> Linh B. Ngo </center>
### <center> CPSC 3620 </center>

#### <center> Common Ways of Doing I/O in Parallel Programs </center>

**All processes send data to rank 0, and 0 writes it to the file**

<center> <img src="pictures/mpiio1.png" width="700"/> 
</center>

- Pros:
    - parallel machine may support I/O from only one process (e.g., no common file system)
    - some I/O libraries (e.g. HDF-4, NetCDF) not parallel
    - resulting single file is handy for ftp, mv
    - big blocks improve performance
    - short distance from original, serial code
- Cons:
    - lack of parallelism limits scalability, performance (single node bottleneck)


**Each process writes to a separate file**

<center> <img src="pictures/mpiio2.png" width="700"/> 
</center>

- Pros: 
    - parallelism, high performance
- Cons:  
    - lots of small files to manage
    - difficult to read back data from different number of processes


#### <center>MPI-IO Approach</center>

** What is Parallel I/O? **

Multiple processes of a parallel program accessing data (reading or writing) from a common file

<center> <img src="pictures/mpiio3.png" width="700"/> 
</center>

** Why Parallel I/O? **

- Non-parallel I/O is simple but
    - Poor performance (single process writes to one file) or
    - Awkward and not interoperable with other tools (each process writes a separate file)
- Parallel I/O
    - Provides high performance
    - Can provide a single file that can be used with other tools (such as visualization programs)


** Why is MPI a good setting for Parallel I/O? **

- Writing is like sending a message and reading is like receiving
- Any parallel I/O system will need a mechanism to
    - define collective operations (MPI communicators)
    - define noncontiguous data layout in memory and file (MPI datatypes)
    - test completion of nonblocking operations (MPI request objects)
- i.e., lots of MPI-like machinery


- Four stages
    * Open File
    * Set File View (optional)
    * Read or Write Data
    * Close File
- All the complexity is hidden in setting the file view
- Write is probably more important in practice than read


** Opening a File (C Syntax) **

~~~
int MPI_File_open(MPI_Comm comm, const char *filename, int amode, MPI_Info info, MPI_File *fh)
~~~

- amode 	File access mode (integer)
- info 		Info object (handle)
- fh 		New file handle (handle)

- MPI_File_open opens the file identified by the filename filename on all processes in the comm communicator group. 
- MPI_File_open is a collective routine; all processes must provide the same value for amode, and all processes must provide filenames that reference the same file and which are textually identical. A process can open a file independently of other processes by using the MPI_COMM_SELF communicator. 
- The file handle returned, fh, can be subsequently used to access the file until the file is closed using MPI_File_close. Before calling MPI_Finalize, the user is required to close (via MPI_File_close) all files that were opened with MPI_File_open. 
- Initially, all processes view the file as a linear byte stream; that is, the etype and filetype are both MPI_BYTE. The file view can be changed via the MPI_File_set_view routine.

- MPI_MODE_APPEND 
- MPI_MODE_CREATE -- Create the file if it does not exist. 
- MPI_MODE_DELETE_ON_CLOSE 
- MPI_MODE_EXCL -- Error creating a file that already exists. 
- MPI_MODE_RDONLY -- Read only. 
- MPI_MODE_RDWR -- Reading and writing. 
- MPI_MODE_SEQUENTIAL 
- MPI_MODE_WRONLY -- Write only. 
- MPI_MODE_UNIQUE_OPEN

The modes MPI_MODE_RDONLY, MPI_MODE_RDWR, MPI_MODE_WRONLY, and MPI_MODE_CREATE have identical semantics to their POSIX counterparts. It is erroneous to specify MPI_MODE_CREATE in conjunction with MPI_MODE_RDONLY. Errors related to the access mode are raised in the class MPI_ERR_AMODE.


** Set File View (C Syntax) **
~~~
int MPI_File_set_view(MPI_File fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, const char *datarep, MPI_Info info)
~~~

- disp Displacement (integer).
- etype Elementary data type (handle).
- filetype File type (handle). See Restrictions, below.
- datarep Data representation (string).
- info Info object (handle).

The MPI_File_set_view routine changes the process’s view of the data in the file:
- the beginning of the data accessible in the file through that view is set to disp
- the type of data is set to etype; and the distribution of data to processes is set to filetype. 
- resets the independent file pointers and the shared file pointer to zero. 
- is collective across the fh; all processes in the group must pass identical values for datarep and provide an etype with an identical extent. 
- the values for disp, filetype, and info may vary. 
- he disp displacement argument specifies the position (absolute offset in bytes from the beginning of the file) where the view begins.

** Reading a File (C Syntax) **
~~~
int MPI_File_read(MPI_File fh, void *buf,  int count, MPI_Datatype datatype, MPI_Status *status)
~~~

- fh File handle (handle).
- count Number of elements in buffer (integer).
- datatype Data type of each buffer element (handle).
- buf Initial address of buffer (integer).
- status Status object (status).


** Reading a File (C Syntax) **
~~~
int MPI_File_seek(MPI_File fh, MPI_Offset offset, int whence)
~~~

- fh File handle (handle).
- offset File offset (integer).
- whence Update mode (integer).

MPI_File_seek updates the individual file pointer according to whence, which could have the following possible values: 

- MPI_SEEK_SET - The pointer is set to offset. 
- MPI_SEEK_CUR - The pointer is set to the current pointer position plus offset. 
- MPI_SEEK_END - The pointer is set to the end of the file plus offset.

The offset can be negative, which allows seeking backwards. It is erroneous to seek to a negative position in the file. The end of the file is defined to be the location of the next elementary data item immediately after the last accessed data item, even if that location is a hole.


** Closing a File (C Syntax) **
~~~
MPI_File_close(MPI_File *fh)
~~~


In [1]:
%%writefile codes/mpi4py/mpiio_seqwrite.py
#!/usr/bin/env python
from mpi4py import MPI
import numpy as np
    
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
amode = MPI.MODE_WRONLY|MPI.MODE_CREATE
comm = MPI.COMM_WORLD
fh = MPI.File.Open(comm, "./datafile.contig", amode)
   
buffer = np.empty(10, dtype=np.int)
buffer[:] = rank
print(buffer)

offset = comm.Get_rank()*buffer.nbytes
fh.Write_at_all(offset, buffer)  
fh.Close()

if (rank == 0):
    print (np.fromfile("./datafile.contig", dtype="int"))

Overwriting codes/mpi4py/mpiio_seqwrite.py


In [2]:
!chmod 755 codes/mpi4py/mpiio_seqwrite.py
!module load gcc/5.3.0 openmpi/1.10.3;mpirun -np 4 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_seqwrite.py

[1 1 1 1 1 1 1 1 1 1]
[2 2 2 2 2 2 2 2 2 2]
[3 3 3 3 3 3 3 3 3 3]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
 3 3 3]


In [3]:
%%writefile codes/mpi4py/mpiio_circwrite.py
#!/usr/bin/env python
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank();size = comm.Get_size();

amode = MPI.MODE_WRONLY|MPI.MODE_CREATE
fh = MPI.File.Open(comm, "./datafile.noncontig", amode)
item_count = 10
buffer = np.empty(item_count, dtype='i')
buffer[:] = rank
print (buffer)
filetype = MPI.INT.Create_vector(item_count, 1, size)
filetype.Commit()

displacement = MPI.INT.Get_size()*rank
fh.Set_view(displacement, filetype=filetype)

fh.Write_all(buffer)
filetype.Free()
fh.Close()

if (rank == 0):
    print (np.fromfile("./datafile.noncontig", dtype="i"))

Overwriting codes/mpi4py/mpiio_circwrite.py


In [4]:
!chmod 755 codes/mpi4py/mpiio_circwrite.py
!module load gcc/5.3.0 openmpi/1.10.3;mpirun -np 4 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_circwrite.py

[1 1 1 1 1 1 1 1 1 1]
[2 2 2 2 2 2 2 2 2 2]
[3 3 3 3 3 3 3 3 3 3]
[0 0 0 0 0 0 0 0 0 0]
[0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0
 1 2 3]


In [3]:
%%writefile codes/mpi4py/mpiio_bigwrite.py
#!/usr/bin/env python
from mpi4py import MPI
import numpy as np
    
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
amode = MPI.MODE_WRONLY|MPI.MODE_CREATE
comm = MPI.COMM_WORLD
fh = MPI.File.Open(comm, "/scratch1/lngo/datafile.contig", amode)

local_count = (int)(1600000000 / size)

buffer = np.empty(local_count, dtype=np.int)
buffer[:] = rank

offset = comm.Get_rank()*buffer.nbytes
fh.Write_at_all(offset, buffer)  
fh.Close()

Overwriting codes/mpi4py/mpiio_bigwrite.py


In [4]:
!chmod 755 codes/mpi4py/mpiio_bigwrite.py
!module load gcc/5.3.0 openmpi/1.10.3;time mpirun -np 8 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_bigwrite.py
!ls -lh /scratch1/lngo/datafile.contig; rm /scratch1/lngo/datafile.contig
!module load gcc/5.3.0 openmpi/1.10.3;time mpirun -np 16 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_bigwrite.py
!ls -lh /scratch1/lngo/datafile.contig; rm /scratch1/lngo/datafile.contig
!module load gcc/5.3.0 openmpi/1.10.3;time mpirun -np 32 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_bigwrite.py
!ls -lh /scratch1/lngo/datafile.contig; rm /scratch1/lngo/datafile.contig


real	1m36.914s
user	0m4.480s
sys	0m25.080s
-rw-r--r-- 1 lngo bigdata 12G Oct 23 11:35 /scratch1/lngo/datafile.contig

real	0m13.298s
user	0m15.483s
sys	0m28.579s
-rw-r--r-- 1 lngo bigdata 12G Oct 23 11:40 /scratch1/lngo/datafile.contig

real	0m9.252s
user	0m21.329s
sys	0m14.430s
-rw-r--r-- 1 lngo bigdata 12G Oct 23 11:40 /scratch1/lngo/datafile.contig


In [5]:
%%writefile codes/mpi4py/mpiio_bigwrite_2.py
#!/usr/bin/env python
from mpi4py import MPI
import numpy as np
    
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
amode = MPI.MODE_WRONLY|MPI.MODE_CREATE
comm = MPI.COMM_WORLD
fh = MPI.File.Open(comm, "/home/lngo/datafile.contig", amode)

local_count = (int)(1600000000 / size)

buffer = np.empty(local_count, dtype=np.int)
buffer[:] = rank

offset = comm.Get_rank()*buffer.nbytes
fh.Write_at_all(offset, buffer)  
fh.Close()

Overwriting codes/mpi4py/mpiio_bigwrite_2.py


In [6]:
!chmod 755 codes/mpi4py/mpiio_bigwrite_2.py
!module load gcc/5.3.0 openmpi/1.10.3;time mpirun -np 8 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_bigwrite_2.py
!ls -lh /home/lngo/datafile.contig; rm /home/lngo/datafile.contig
!module load gcc/5.3.0 openmpi/1.10.3;time mpirun -np 16 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_bigwrite_2.py
!ls -lh /home/lngo/datafile.contig; rm /home/lngo/datafile.contig
!module load gcc/5.3.0 openmpi/1.10.3;time mpirun -np 32 --mca mpi_cuda_support 0 codes/mpi4py/mpiio_bigwrite_2.py
!ls -lh /home/lngo/datafile.contig; rm /home/lngo/datafile.contig


real	0m27.454s
user	0m2.941s
sys	0m31.769s
-rw-r--r-- 1 lngo bigdata 12G Oct 23 11:41 /home/lngo/datafile.contig

real	0m29.080s
user	0m58.651s
sys	0m24.621s
-rw-r--r-- 1 lngo bigdata 12G Oct 23 11:41 /home/lngo/datafile.contig

real	0m28.408s
user	1m45.810s
sys	0m17.878s
-rw-r--r-- 1 lngo bigdata 12G Oct 23 11:42 /home/lngo/datafile.contig


#### <center> Under the Covers of MPI-IO </center>

- MPI-IO implementation is given a lot of information in this case:
    - Collection of processes reading data
    - Structured description of the regions
- Implementation has some options for how to obtain this data
    - Noncontiguous data access optimizations
    - Collective I/O optimizations


#### <center> General Guidelines for Achieving High I/O Performance </center>

- Buy sufficient I/O hardware for the machine
- Use fast file systems, not NFS-mounted home directories
- Do not perform I/O from one process only
- Make large requests wherever possible
- For noncontiguous requests, use derived datatypes and a single collective I/O call
