WP3 WP4 Technical Foundations and Demonstrators

Contact Information for WP 3 / 4

You can find a description of existing FAIR Data Spaces demonstrators below. If you wish to contact the developers, please use the contact information provided in the following links:

WP 3 - Technical Foundations Sebastian Beyvers

WP 4.1 - NFDI4Biodiversity: Nikolaus Glombiewski

WP 4.2 - Data Quality Assurance: Jonathan Hartman

WP 4.3 - Cross-Platform FAIR Data Analysis: Yeliz Ucer Yediel

Welcome to the wiki of work packages 3 and 4!

WP 3 - Technical Foundations

The technical foundations for FAIR-DS focus on a decentralized, cloud-native infrastructure for the deployment of workloads and demonstrators as well as data storage. The core technology for deployment of workloads is based on modern containerized infrastructure orchestrated via Kubernetes. For this purpose, cloud resources provided by the German Network for Bioinformatics de.NBI are used.

Technologies:

Workload deployments: Kubernetes (containers) and OpenStack (virtual machines)
Storage: Ceph and S3
CI/CD: FluxCD (K8s), GitLab Runners, GitHub Actions

Storage Infrastructure

While workloads are primarily orchestrated via containers on Kubernetes, data storage is orchestrated by a custom distributed storage engine developed primarily as a collaborative effort for the Research Data Commons (RDC) by NFDI4Biodiversity and NFDI4Microbiota, called Aruna.

Aruna

Web / Documentation: https://aruna-storage.org
Repository Link: https://github.com/ArunaStorage/aruna

Aruna is the core component of the data storage architecture. It is written in Rust and consists of two main components: A so-called data proxy that connects to various existing storage solutions (S3, file system, etc.) and makes them accessible through a common S3-compatible interface, allowing easy integration into existing infrastructures and workflows. Data proxies are sovereign gatekeepers over their data and integrate a dedicated policy and authorization system.

The second component is a distributed server component that integrates a global data catalog with metadata indexing all registered records for all proxy locations. This enables the server components to orchestrate data exchange and synchronization with internal and external data consumers and processing workflows. The server also includes its own authentication and authorization system that can ease the burden of integrating full-featured IAM into proxies, simplifying their integration and necessary configuration. The Server components are also gateways to other data spaces via EDC and IDSA compatible interfaces.

WP 4 - FAIR Data Space Demonstrators

WP 4.1 - FAIR-DS Demonstrator: NFDI4Biodiversity

This demonstrator is part of the RDC architecture. An important goal in the development of the RDC is to provide a cloud-based technical infrastructure, which allows users to collaborate in data exploration tasks. Data handled in the demonstrator is vector and raster data, originating from the NFDI4Biodiversity use cases. Example data sets from the NFDI4Biodiversity data domain include animal observations as well as satellite images.

The main software component of this demonstrator is Geo Engine, a cloud-based research environment for spatio-temporal data processing. Geo Engine supports interactive data analyses for geodata, such as vector and raster data. In particular, Geo Engine offers a variety of data connectors and functionality that allows data scientists to focus on the actual data analyses rather than data preparation.

In the FAIR-DS context, Geo Engine adds new data connectors and cloud features to demonstrate important aspects of FAIR-DS. Among other things, this includes novel use cases combining data from a variety of sources as well as authentication according to GAIA-X specifications. Below, we listed the three main repositories of Geo Engine that are used in this project.

Geo Engine

The core of the Geo Engine is written in Rust. It includes a variety of data sources and data processing operators. For FAIR-DS, a data connector to the Aruna Object Storage of the RDC was developed.

Repository Link: https://github.com/geo-engine/geoengine

Geo Engine - User Interface

The main point for interaction with the demonstrator is a web user interface. It can display a variety of geodata and allows users to create analyses through exploratory workflows on an interactive map.

Repository Link: https://github.com/geo-engine/geoengine-ui

Geo Engine - Python Library

As an additional way of interacting with the Geo Engine, the demonstrator can be accessed through a Python library.

Repository Link: https://github.com/geo-engine/geoengine-python

Contact

For further information on the WP 4.1 demonstrator, you can contact the lead developer.

WP 4.2 - FAIR-DS Demonstrator: Data Quality Assurance

The purpose of this demonstrator is to exhibit the use of decentralized task runners to perform automated quality control and data assurance within a commonly available or easliy provided environment. The demonstrator leverages the defined workflow management of the GitLab CI/CD to send sections of a user provided dataset to a cloud-based Kubernetes instance to perform quality assurance, by way of the Frictionless Toolkit and automated statistical analysis. User authentication and secret management are handled by the GitLab instance.

The demonstrator consists of primarily of a python library, which contains all the code nesecary to run the analysis. Also provided are a publicly available GitLab repository, which contains a CI/CD script that automatically calls the library and can be customized to the user's use case, and a Docker Container that comes preinstalled with the library and all dependencies needed execute it.

The Gitlab framework is heavily leveraged, using the unit test reporting tools to inform the user of the status of the provided data and publishing a report hosted through the GitLab pages mechanism.

Python Library

The core of the demonstrator is the python library. This package comes pre-installed in the docker container listed below

Repository

Docker Container

The Docker Container comes pre-installed with the demonstrator and all of the dependent libraries.

Link

Forkable Repository

This repository contains a pre-configured version of the CI/CD script which will run the most recent version of the library. Users may fork this repository and assign their own runners to execute the code. Information on installing personal runners can be found here.

Repository

Example Projects

A collection of example projects using publicly available datasets:

Abalone Respository | Report
Titanic Repository | Report

Contact

For futher information of the AP4.2 Demonstrator, or for any related queries, please feel free to reach out to Jonathan Hartman.

WP 4.3 - FAIR-DS Demonstrator: Cross-Platform FAIR Data Analysis

Contact: yeliz.ucer.yediel@fit.fraunhofer.de

This demonstrator study aims to set up and apply an infrastructure to demonstrate cross-platform privacy-compliant analysis of distributed medical data without sharing it. Since the information is person-related and, thus, sensitive, we implement Personal Health Train (PHT), a distributed analytics infrastructure service for analysing health-domain data.

The overall goal of this demonstrator is to reuse the current achievements of both NFDI (in particular, NFDI4Health) and the MII according to medical data structures, formats, ethical and legal requirements, and secondly, to combine these results with the Gaia-X FAIR Data Space.

The PHT workflow is analogous to a railway network. An analysis workflow is called a train (code) that sequentially runs from each train station (data source) until the final one is reached. This station-specific process is used to analyse some data at each station. The train is a container encapsulating the algorithms, i.e., the analysis script/program, and previously generated intermediate results, such as the classification model trained by other stations (taking their available data into account). The central station collects the analysis outputs in each train station. The code is independent of programming languages.