Skip to content

EDAMER: Exascale Data Analysis Methods with Enhanced Reusability

License

Unknown and 2 other licenses found

Licenses found

Unknown
LICENSE.md
Unknown
COPYING-CMAKE-SCRIPTS
LGPL-3.0
COPYING-LGPL3
Notifications You must be signed in to change notification settings

JM1/python-edamer

Repository files navigation

Python library edamer feat. distributed algorithms powered by hbrs-mpl and Elemental

edamer is a Python 3 library (GitHub.com, H-BRS GitLab, FHG GitLab) that provides data structures and algorithms like PCA for distributed scientific computing at HPC clusters. Our research goal is to make the full power of generic C++ libraries usable for data analysis and machine learning applications in Python without sacrificing space and time efficiency. For example, edamer allows to run a distributed PCA on CFD data from NEK5000 in-situ from a ParaView Catalyst script written in Python only.

Its development started in Juli 2020 as part of the EDAMER research project and is funded by Fraunhofer SCAI. EDAMER is an acronym for Exascale Data Analysis Methods with Enhanced Reusability and expresses our ambition to bring our efficient and scalable software components for dense linear algebra and dimension reduction from C++ library hbrs-mpl to the broader audience of Python developers. We want to equip scientists with a generic framework and a predefined set of reusable and robust components which allows them to codify complex algorithms for exascale computers rapidly, without need for in-depth programming knowledge in C++ or HPC. Numerical analysis of large simulation datasets is a representative domain for generic libraries, as quick and easy comparison of varied algorithms is of particular interest here.

⚠️ WARNING: This code is still under development and is not ready for production yet. Code might change or break at any time and is missing proper documentation. :warning:

⚠️ DEPRECATION NOTICE: Apparently, this code meets the inevitable fate of many state-funded research projects. It has not been actively worked on since early 2021. Software consists of teams of people. If you want people to continue developing a project after it ceases to be their personal interest, fund them for it. ⚠️

Example: Distributed PCA in 12 lines of Python code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright (c) 2020 Jakob Meng, <jakobmeng@web.de>

from edamer import detail, dt, fn
from mpi4py import MPI
import numpy as np

# Generate example data
dataset = np.asarray(np.arange(1000*2000).reshape(1000, 2000), order='F')

# Wrap NumPy array in Elemental matrix (no copy)
matrix = dt.ElMatrix.view_from_numpy(dataset)

# Construct distributed Elemental matrix from local matrices (no copy)
grid = dt.ElGrid(MPI.COMM_WORLD)
star_star = dt.MatrixDistribution.make(dt.ElDist.STAR, dt.ElDist.STAR, dt.ElDistWrap.ELEMENT)
dist_matrix = dt.ElDistMatrix.make_view(grid, matrix, star_star)

# Apply PCA on distributed Elemental matrix
pca_ctrl = dt.PcaControl.make(economy=True, center=False, normalize=False)
dec = fn.pca(dist_matrix, pca_ctrl)

# Rebuild and test dataset
rebuild = fn.multiply(dec.score, fn.transpose(dec.coeff))
assert detail.test.matrix_matrix_allclose(dataset, rebuild)

Under the hood

edamer's functions are mostly geared towards compatibility with MATLAB's API, because the latter has a strong focus on mathematical notations, is properly documented and useful for rapid prototyping.

edamer builds heavily upon C++ libraries hbrs-mpl and Elemental which provide HPC-ready data structures and algorithms for linear algebra and dimension reduction.

The full tech stack consists of:

Status Quo

So far, Elemental's data structures for non-distributed and distributed matrices and its corresponding wrappers from hbrs-mpl has been integrated with NumPy. For example, 2d NumPy arrays (matrices) can be converted into non-distributed Elemental matrices and vice versa, without having to copy any matrix entry. Further, Elemental's MPI interface has been integrated with mpi4py. This allows e.g. to define the MPI computation grid with mpi4py and then hand it over Elemental. This conversion is done implicitly, i.e. a custom adapter takes care of converting mpi4py communicators into MPI handles (as defined by the official MPI C API) that can be consumed by Elemental and vice versa. Distributed Elemental matrices can be constructed from a MPI computation grid and local matrices. All of Elemental's matrix distributions are available in edamer, i.e. [MC,MR], [STAR,STAR] and [VC,STAR].

The functionality for the Python and C++ interop is heavily based on pybind11 and Boost.Hana.

How to build, install and run code using Docker or Podman

For a quick and easy start into developing with Python and C++, a set of ready-to-use Docker/Podman images jm1337/debian-dev-hbrs and jm1337/debian-dev-full (supports more languages) has been created. They contain a full development system including all tools and libraries necessary to hack on distributed decomposition algorithms and more (Docker Hub, source files for Docker images).

Sidenote:

Creating the Docker images was tedious, especially because bugs (#959387, #972551) in Debian's ParaView package (affected Ubuntu and derivates as well) had to be fixed or worked around, i.e. Debian's did not package the development files necessary to use e.g. ParaView Catalyst. But by now most of the proposed patches have been incorporated into Debian.

Install Docker or Podman

  • On Debian 10 (Buster) or Debian 11 (Bullseye) just run sudo apt install docker.io or follow the official install guide for Docker Engine on Debian
  • On Ubuntu 18.04 LTS (Bionic Beaver) and Ubuntu 20.04 LTS (Focal Fossa) just run sudo apt install docker.io (from bionic/universe and focal/universe repositories) or follow the official install guide for Docker Engine on Ubuntu
  • On Windows 10 follow the official install guide for Docker Desktop on Windows
  • On Mac follow the official install guide for Docker Desktop on Mac
  • On Fedora, Red Hat Enterprise Linux (RHEL) and CentOS follow the official install guide for Podman

Setup and run container

# docker version 18.06.0-ce or later is recommended
docker --version

# fetch docker image
docker pull jm1337/debian-dev-hbrs:bullseye

# log into docker container
docker run -ti jm1337/debian-dev-hbrs:bullseye
# or using a persistent home directory, e.g.
docker run -ti -v /HOST_DIR:/home/devil/ jm1337/debian-dev-hbrs:bullseye
# or using a persistent home directory on Windows hosts, e.g.
docker run -ti -v C:\YOUR_DIR:/home/devil/ jm1337/debian-dev-hbrs:bullseye

Podman strives for complete CLI compatibility with Docker, hence you may use the alias command to create a docker alias for Podman:

alias docker=podman

Build and run code inside container

Execute the following commands within the Docker/Podman container:

# choose a compiler
export CC=clang-10
export CXX=clang++-10
# or
export CC=gcc-10
export CXX=g++-10

# fetch, compile and install prerequisites
git clone --depth 1 https://github.com/JM1/hbrs-cmake.git
cd hbrs-cmake
mkdir build && cd build/
# install to non-system directory because sudo is not allowed in this docker container
cmake \
    -DCMAKE_INSTALL_PREFIX=$HOME/.local \
    ..
make -j$(nproc)
make install
cd ../../

git clone --depth 1 https://github.com/JM1/hbrs-mpl.git
cd hbrs-mpl
mkdir build && cd build/
cmake \
 -DCMAKE_INSTALL_PREFIX=$HOME/.local \
 -DHBRS_MPL_ENABLE_ELEMENTAL=ON \
 -DHBRS_MPL_ENABLE_MATLAB=OFF \
 -DHBRS_MPL_ENABLE_TESTS=OFF \
 -DHBRS_MPL_ENABLE_BENCHMARKS=OFF \
 ..
make -j$(nproc)
make install
cd ../../

# fetch, compile and install python-edamer
git clone --depth 1 https://github.com/JM1/python-edamer.git
cd python-edamer
mkdir build && cd build/
cmake \
    -DCMAKE_Python3_COMPILER_FORCED=ON \
    -DCMAKE_INSTALL_PREFIX=$HOME/.local \
    -DEDAMER_ENABLE_SCALAR_INT=ON \
    -DEDAMER_ENABLE_SCALAR_FLOAT=ON \
    -DEDAMER_ENABLE_SCALAR_DOUBLE=ON \
    -DEDAMER_ENABLE_SCALAR_COMPLEX_FLOAT=ON \
    -DEDAMER_ENABLE_SCALAR_COMPLEX_DOUBLE=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_CIRC_CIRC=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_MC_MR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_MC_STAR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_MD_STAR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_MR_MC=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_MR_STAR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_STAR_MC=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_STAR_MD=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_STAR_MR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_STAR_STAR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_STAR_VC=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_STAR_VR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_VC_STAR=ON \
    -DEDAMER_ENABLE_MATRIX_DISTRIBUTION_VR_STAR=ON \
    -DEDAMER_ENABLE_TESTS=ON \
    -DMPIEXEC_MAX_NUMPROCS=2 \
    ..
make -j$(nproc)
# If unit test dt_el_dist_matrix fails in function test_copy_redist due to zeros
# in the upper matrix indices, then try to use a different MPI point-to-point
# management layer, e.g. ob1 instead of ucx.
#export OMPI_MCA_pml=ob1
ctest --verbose --output-on-failure
make install

export LD_LIBRARY_PATH=$HOME/.local/lib
export PYTHONPATH=$HOME/.local/lib/python3/dist-packages
python3 -c "from edamer import detail, dt, fn; print('All systems go')"

For more examples on how to build and test this code see .gitlab-ci.yml.

Knowledge Base

Why does compilation take so much time and memory? The compiled library is several hundreds of megabytes large!!!

Calling C++ functions from Python requires to declare and compile bindings with pybind11 for all C++ function signatures that should be callable from Python into the wrapper library. For example, the function multiply must be bound for Elemental matrices with entry type int, float, double, .... If multiply is not wrapped for Elemental's matrices with float entries, then it cannot be used from Python. Currently, multiply is wrapped for int, float, double, complex<float> and complex<double>. For Elemental's non-distributed matrices, this results into five function overloads for multiply in Python:

1. multiply(a: ElMatrix_StdInt32T, b: ElMatrix_StdInt32T) -> ElMatrix_StdInt32T
2. multiply(a: ElMatrix_Float, b: ElMatrix_Float) -> ElMatrix_Float
3. multiply(a: ElMatrix_Double, b: ElMatrix_Double) -> ElMatrix_Double
4. multiply(a: ElMatrix_ElComplex_Float, b: ElMatrix_ElComplex_Float) -> ElMatrix_ElComplex_Float
5. multiply(a: ElMatrix_ElComplex_Double, b: ElMatrix_ElComplex_Double) -> ElMatrix_ElComplex_Double

For Elemental's distributed matrices it gets messy. For example, it should be possible to multiply a matrix with [STAR,STAR] distribution and a matrix with [MC,MR] distribution. Hence multiply must be provided for all possible combinations of matrix distributions, i.e. the cartesian product of [int, float, double, ...], [[STAR,STAR],[MC,MR],[MR,MC], ...] and [[STAR,STAR],[MC,MR],[MR,MC], ...]. This results into another 845(!) function overloads for multiply!

For each of these Python function overloads, a C++ compiler generates a separate code path, because function templates in C++ get instantiated. Generic code in other languages such as Java, is compiled differently, e.g. Java implements generics using type erasure and generates code just once for all generic functions. Both has advantages and disadvantages, i.e. a C++ compiler can apply optimizations a Java compiler cannot. But with many template instantiations, C++ libraries might get huge in size. For all entry types and all matrix distributions, the object file of multiply growths to 50MiB in debug mode. The compilation and linking processes take up to 6GiB of RAM.

To reduce compilation time, memory usage and code size, irrelevant matrix entry types aka scalar types and matrix distributions can be disabled at compile time using CMake options EDAMER_ENABLE_SCALAR_* and EDAMER_ENABLE_MATRIX_DISTRIBUTION_*. But beware that disabled matrix template instantiations cannot be used as function arguments and function return values! For example, if transpose is applied to a matrix with [MC,MR] distribution, then transpose will return a matrix with [MR,MC] distribution. The returned matrix can only be used if this matrix distribution has been compiled in. Accessing a return value with a type that has not been compiled in results in an runtime error.

Unit test dt_el_dist_matrix fails in function test_copy_redist due to zeros in the upper matrix indices!?

Try to use a different MPI point-to-point management layer, e.g. ob1 instead of ucx. For example, set and export variable OMPI_MCA_pml before executing the unit tests:

export OMPI_MCA_pml=ob1
ctest --verbose

Reason for this error is still unknown. Till now, it only occurs sporadically and inside container jm1337/debian-dev-hbrs:bullseye.

License

GNU Lesser General Public License v3.0 or later

See LICENSE.md to see the full text.

Author

Jakob Meng @jm1 (GitHub.com, Web)