-
Notifications
You must be signed in to change notification settings - Fork 0
WP3 WP4 Technical Foundations and Demonstrators
You can find a description of existing FAIR Data Spaces demonstrators below. If you wish to contact the developers, please use the contact information provided in the following links:
WP 3 - Technical Foundations Sebastian Beyvers
WP 4.1 - NFDI4Biodiversity: Nikolaus Glombiewski
WP 4.2 - Data Quality Assurance: Jonathan Hartman
WP 4.3 - Cross-Platform FAIR Data Analysis: Yeliz Ucer Yediel
The technical foundations for FAIR-DS focus on a decentralized, cloud-native infrastructure for the deployment of workloads and demonstrators as well as data storage. The core technology for deployment of workloads is based on modern containerized infrastructure orchestrated via Kubernetes. For this purpose, cloud resources provided by the German Network for Bioinformatics de.NBI are used.
- Workload deployments: Kubernetes (containers) and OpenStack (virtual machines)
- Storage: Ceph and S3
- CI/CD: FluxCD (K8s), GitLab Runners, GitHub Actions
While workloads are primarily orchestrated via containers on Kubernetes, data storage is orchestrated by a custom distributed storage engine developed primarily as a collaborative effort for the Research Data Commons (RDC) by NFDI4Biodiversity and NFDI4Microbiota, called Aruna.
- Web / Documentation: https://aruna-storage.org
- Repository Link: https://github.com/ArunaStorage/aruna
Aruna is the core component of the data storage architecture. It is written in Rust and consists of two main components: A so-called data proxy that connects to various existing storage solutions (S3, file system, etc.) and makes them accessible through a common S3-compatible interface, allowing easy integration into existing infrastructures and workflows. Data proxies are sovereign gatekeepers over their data and integrate a dedicated policy and authorization system.
The second component is a distributed server component that integrates a global data catalog with metadata indexing all registered records for all proxy locations. This enables the server components to orchestrate data exchange and synchronization with internal and external data consumers and processing workflows. The server also includes its own authentication and authorization system that can ease the burden of integrating full-featured IAM into proxies, simplifying their integration and necessary configuration. The Server components are also gateways to other data spaces via EDC and IDSA compatible interfaces.
This demonstrator is part of the RDC architecture. An important goal in the development of the RDC is to provide a cloud-based technical infrastructure, which allows users to collaborate in data exploration tasks. Data handled in the demonstrator is vector and raster data, originating from the NFDI4Biodiversity use cases. Example data sets from the NFDI4Biodiversity data domain include animal observations as well as satellite images.
The main software component of this demonstrator is Geo Engine, a cloud-based research environment for spatio-temporal data processing. Geo Engine supports interactive data analyses for geodata, such as vector and raster data. In particular, Geo Engine offers a variety of data connectors and functionality that allows data scientists to focus on the actual data analyses rather than data preparation.
In the FAIR-DS context, Geo Engine adds new data connectors and cloud features to demonstrate important aspects of FAIR-DS. Among other things, this includes novel use cases combining data from a variety of sources as well as authentication according to GAIA-X specifications. Below, we listed the three main repositories of Geo Engine that are used in this project.
The core of the Geo Engine is written in Rust. It includes a variety of data sources and data processing operators. For FAIR-DS, a data connector to the Aruna Object Storage of the RDC was developed.
Repository Link: https://github.com/geo-engine/geoengine
The main point for interaction with the demonstrator is a web user interface. It can display a variety of geodata and allows users to create analyses through exploratory workflows on an interactive map.
Repository Link: https://github.com/geo-engine/geoengine-ui
As an additional way of interacting with the Geo Engine, the demonstrator can be accessed through a Python library.
Repository Link: https://github.com/geo-engine/geoengine-python
For further information on the WP 4.1 demonstrator, you can contact the lead developer.
The purpose of this demonstrator is to exhibit the use of decentralized task runners to perform automated quality control and data assurance within a commonly available or easliy provided environment. The demonstrator leverages the defined workflow management of the GitLab CI/CD to send sections of a user provided dataset to a cloud-based Kubernetes instance to perform quality assurance, by way of the Frictionless Toolkit and automated statistical analysis. User authentication and secret management are handled by the GitLab instance.
The demonstrator consists of primarily of a python library, which contains all the code nesecary to run the analysis. Also provided are a publicly available GitLab repository, which contains a CI/CD script that automatically calls the library and can be customized to the user's use case, and a Docker Container that comes preinstalled with the library and all dependencies needed execute it.
The Gitlab framework is heavily leveraged, using the unit test reporting tools to inform the user of the status of the provided data and publishing a report hosted through the GitLab pages mechanism.
The core of the demonstrator is the python library. This package comes pre-installed in the docker container listed below
The Docker Container comes pre-installed with the demonstrator and all of the dependent libraries.
This repository contains a pre-configured version of the CI/CD script which will run the most recent version of the library. Users may fork this repository and assign their own runners to execute the code. Information on installing personal runners can be found here.
A collection of example projects using publicly available datasets:
- Abalone Respository | Report
- Titanic Repository | Report
For futher information of the AP4.2 Demonstrator, or for any related queries, please feel free to reach out to Jonathan Hartman.
Contact: yeliz.ucer.yediel@fit.fraunhofer.de
This demonstrator study aims to set up and apply an infrastructure to demonstrate cross-platform privacy-compliant analysis of distributed medical data without sharing it. Since the information is person-related and, thus, sensitive, we implement Personal Health Train (PHT), a distributed analytics infrastructure service for analysing health-domain data.
The overall goal of this demonstrator is to reuse the current achievements of both NFDI (in particular, NFDI4Health) and the MII according to medical data structures, formats, ethical and legal requirements, and secondly, to combine these results with the Gaia-X FAIR Data Space.
The PHT workflow is analogous to a railway network. An analysis workflow is called a train (code) that sequentially runs from each train station (data source) until the final one is reached. This station-specific process is used to analyse some data at each station. The train is a container encapsulating the algorithms, i.e., the analysis script/program, and previously generated intermediate results, such as the classification model trained by other stations (taking their available data into account). The central station collects the analysis outputs in each train station. The code is independent of programming languages.
For further details about the PHT implementation (PADME), please access the link below:
Website: PADME
For further details about the PHT Technical Documentation, please access the link below:
Website: https://docs.padme-analytics.de/
These are easy-to-use How-to Documents that will guide us to dive into our demonstrator environment.
If you would like to support with a Data Provider Role and take part in Federated Learning use cases, please see: How to: Initial Station Setup https://docs.padme-analytics.de/en/how-to/initial-station-setup How to Use: Station Registry https://docs.padme-analytics.de/en/how-to/station-registry How to Use: Station Software https://docs.padme-analytics.de/en/how-to/StationSoftware
If you would like to partner with an Analytics/Federated Algorithm Development Role and execute your code in Federated Architecture, please see:
Converting Centralized Learning to Federated Learning and to a PHT Train https://docs.padme-analytics.de/en/how-to/centralized-to-federated-to-pht
How to: Sign Train Images https://docs.padme-analytics.de/en/how-to/sign-train-images
How to Use: Train Creator https://docs.padme-analytics.de/en/how-to/train-creator
How to Use: Train Requester https://docs.padme-analytics.de/en/how-to/train-requester
Development Environment Setup https://docs.padme-analytics.de/en/internal/getting-started/dev-env-setup