Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerating massive data processing in Python with Heat - a tutorial #2

Open
mrfh92 opened this issue Jan 22, 2024 · 2 comments
Open

Comments

@mrfh92
Copy link

mrfh92 commented Jan 22, 2024

Please note, this tutorial has been merged with #10 HPC for Researchers, i.e., both will handled in one full-day tutorial.

Title

Accelerating massive data processing in Python with Heat - a tutorial

Responsible persons

  • Fabian Hoppe @mrfh92, Deutsches Zentrum für Luft- und Raumfahrt e.V., Institut für Softwaretechnologie, High-Performance Computing, Köln
  • Kai Krajsek @krajsek, Forschungszentrum Jülich GmbH, Institute for Advanced Simulation, Jülich Supercomputing Centre
  • Claudia Comito @ClaudiaComito, Forschungszentrum Jülich GmbH, Institute for Advanced Simulation, Jülich Supercomputing Centre

Format

tutorial with

  • a brief introduction talk (15-20 minutes)
  • and a hands-on part (rest of the time)

Timeframe

2-3 hours (+break(s))

Description

Many data processing workflows in science and technology build on Python libraries like NumPy, SciPy, scikit learn etc., that are easy-to-learn and easy-to-use. In addition, these libraries are based on highly optimized computational backends and thus allow to achieve quite a competitive performance --- at least as long as no GPU-acceleration is taken into account and as long as the memory of a single workstation/cluster-node is sufficient for all required tasks.
However, in the presence of steadily growing data sets the limitation to the RAM of a single machine may pose a severe obstacle. Nevertheless, the step from a workstation to a (GPU-)cluster can be challenging for domain experts without prior HPC-experience.

This group of users is targeted by our Python library Heat ("Helmholtz Analytics Toolkit") to which we want to give a brief hands-on introduction in this tutorial. Our library builds on PyTorch and mpi4py and simplifies porting of NumPy/SciPy-based code to GPU (CUDA, ROCm), including multi-GPU, multi-node clusters. On the surface, Heat implements a NumPy-like API, is largely interoperable with the Python array ecosystem, and can be employed seamlessly as a backend to accelerate existing single-CPU pipelines, as well as to develop new HPC-applications from scratch. Under the hood, Heat distributes memory-intensive operations and algorithms via MPI-communication and thus avoids some of the overhead that is often introduced by different, task-parallelism-based libraries for scaling NumPy/SciPy/scikit-learn applications.

In this tutorial you will get an overview of:

  • Heats basics: getting started with distributed I/O, data decomposition scheme, array operations
  • Existing functionalities: multi-node linear algebra, statistics, signal processing, machine learning…
  • DIY how-to: using existing Heat infrastructure to build your own multi-node, multi-GPU research software.

We will also touch upon Heats implementation roadmap, and possible paths to collaboration.

Requirements

  • Teaching room with a beamer
  • Attendees need to bring their own hardware (laptop), ideally with the option to connect to an HPC-system via SSH
  • To enable at least a minimum of individual interaction: max. ~ 15-20 attendees

References

M. Götz et al., HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics, 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 2020, pp. 276-287, doi: 10.1109/BigData50022.2020.9378050.

@mrfh92 mrfh92 changed the title Accelerating massive data processing in Python with [Heat](https://github.com/helmholtz-analytics/heat/) - a tutorial Accelerating massive data processing in Python with Heat - a tutorial Jan 22, 2024
@SusanneWenzel
Copy link
Collaborator

Thank you for this submission @mrfh92. Sounds very good! I'm happy to help setting it up. Let's talk about logistics and further details later.

@mrfh92
Copy link
Author

mrfh92 commented Jan 25, 2024

Thanks 👍
Do not hesitate to contact us if you need further information etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants