Skip to content

EFS-OpenSource/superb-data-kraken-ingest

Repository files navigation

Ingest

python39 License

SDK LOGO
A data platform for everyone

About

The ingest is a service of the Superb Data Kraken Platform (SDK). It is designed for managing data-ingestion to the SDK.

For a more detailed understanding of the broader context of the platform this project is used in, refer to the architecture documentation.

For instructions on how to deploy the ingest on an instance of the SDK, refer to the installation instructions.

The workers that are part of the ingest are explained in more detail below.

Ingest

skip_validation

Certain organizations may not need to validate data (should be ingested as is). This worker provides a functionality, so one can configure, for which organizations no validation is required. Configuration via SKIP_VALIDATE_ORGANIZATIONS.

basic_metadata

In case no "qualified" metadata is provided, this worker generates a basic metadata-set and stores it in the cloud storage (loadingzone).

anonymize

This worker will provide functionality to anonymize metadata. However it is not implemented yet.

enrichment

This worker will provide functionality to enrich metadata. However it is not implemented yet.

validate

This worker will provide functionality to validate the dataset. However it is not implemented yet.

metadata_index

Indexes meta.json to the dedicated <orga>_<space>_measurements-index via metadata-service. For this, the worker accesses the cloud storage to read the meta.json and pass the content to the service. The document-id is stored in a dedicated file ingest.json - this prevents multiple indexing. CAUTION: Only users with the role <orga>_<space>_trustee may update documents - executing the ingest multiple times from a user without trustee-permission will lead to errors!

The following environment variables are required:

name description
CLIENT_ID client-id of confidential OAuth-Client
CLIENT_SECRET client-secret of confidential OAuth-Client
ACCESS_TOKEN_URI URI of the token-endpoint
INDEXER_URL URL of the metadata-backend
STORAGE_TYPE storage-type - one of azure and s3 (default: azure - s3 currently not supported)
ACCESSMANAGER_URL URL of the accessmanager (only required if azure-storage)
STORAGE_DOMAIN domain of the storage-implementation (only required, if s3-storage - currently not supported)
BUCKET storage-bucket (only required, if s3-storage - currently not supported)

move_data

Finally, the data is being moved from loadingzone to the main-storage.

The following pipeline variables are required:

name description
CLIENT_ID client-id of confidential OAuth-Client
CLIENT_SECRET client-secret of confidential OAuth-Client
ACCESS_TOKEN_URI URI of the token-endpoint
STORAGE_TYPE storage-type - one of azure and s3 (default: azure - s3 currently not supported)
READ_ENDPOINT endpoint for generating SAS-Token in read-scope (only required, if azure-storage)
UPLOAD_ENDPOINT endpoint for generating SAS-Token in upload-scope (only required, if azure-storage)
DELETE_ENDPOINT endpoint for generating SAS-Token in delete-scope (only required, if azure-storage)
BLACKLIST comma-separated list of wildcarded blob names that should not be moved to main-storage but deleted directly
<ORGA>.WHITELIST comma-separated list of wildcarded blob names that should be moved to main-storage
STORAGE_DOMAIN domain of the storage-implementation (only required, if s3-storage - currently not supported)
BUCKET storage-bucket (only required, if s3-storage - currently not supported)

NOTE on black- and whitelist: The blacklist applies globally. It can be used to define files that can potentially cause damage to the system (*.exe, *.bat). If your organization only has certain file-extensions, you can use the organization-scoped whitelist to prevent uploading other extensions. The blacklist restricts each whitelist.

If you have the following configuration a bat- or a png-file would not be moved to main-storage:

blacklist = "*.exe,*.bat"
whitelist = "*.csv,*.json,*.bat"

Getting Started

Follow the instructions below to set up a local copy of the project for development and testing.

Prerequisites

Setup

You may provide a secret auth-secret (as referenced by the ingest-sensor), with the following setup:

apiVersion: v1
data:
  ACCESS_TOKEN_URI: <ACCESS_TOKEN_URI_BASE64>
  CLIENT_ID: <CLIENT_ID_BASE64>
  CLIENT_SECRET: <CLIENT_SECRET_BASE64>
kind: Secret
metadata:
  name: auth-secret
  namespace: argo-mgmt
type: Opaque

This secret is being refered to from metadata_index and move_data.

Configuration

The configuration of the ingest takes place in argo/config-map.yml.

As already mentioned in skip_validation the property SKIP_VALIDATE_ORGANIZATIONS is a comma-separated list of organizations, that should not be validated.

Every other configuration within this file refers to your cluster-internal domain. Aside from a possible postfix, nothing more must be configured.

Usage

The ingest is an argo events sensor with an event-source for the accessmanager-commit-event. So the ingest is triggered, every time a dataset is committed via accessmanager.

Contributing

See the Contribution Guide.

Changelog

See the Changelog.

Releases

No releases published

Packages

No packages published

Languages