Skip to content

How To: DKB instance from scratch

Marina Golosova edited this page Oct 14, 2020 · 1 revision

This page provides information on development and establishment of a new instance Data Knowledge Base in a non-ATLAS related use-case -- from the use-case specifics analysis to recomendations on technical solutions.

When to use the DKB approach

  1. Metadata are Big Data in the terms of 3V:
    • Volume, Velocity (original storages are optimized for daily routine operations, not for analytical tasks);
    • Variety (metadata are spread across multiple sources);
  2. Analytical tasks requirements cannot be fulfilled with the original infrastructure (or require new development for every new task).

Use-case analysis

  1. Develop the information model: concepts, relationships, attributes.
  2. Analyze metadata usage scenarios (in terms of the model):
    • object(s) selection:
      • lookup by primary key;
      • search by pre-defined (set of) attributes;
      • search by arbitrary (set of) attributes;
      • search by links between objects (one-to-one, one-to-many, many-to-many);
    • attribute(s) values aggregation over selected objects;
    • time series aggregation;
    • (aggregated) time series analysis;
    • ...
  3. Estimate metadata volumes.

Integrated metadata storage

  1. Choose metadata storage/query technology(-ies).
  2. Develop the storage/indexing schema(s):
    • denormalizing the information model with respect to the
  3. Install.
  4. Configure (load schema(s), etc).

ETL process(es)

  1. Analyze metadata sources.
  2. Identify primary source(s) and their relationships.
  3. Develop the ETL process(es)' scheme (steps definition):
    • (E)xtraction of new/updated records from the primary source;
    • (T)ransformations:
      • extraction of related information from a secondary source;
      • calculation of derived values and surrogate keys generation;
      • format conversion;
    • (L)oad to the final storage.
  4. Install pyDKB library (see instructions).
  5. Implement ETL steps (independently, as standalone programs sharing only input/output data format):
  6. Chain steps (according to the ETL process(es) scheme(s)):
  7. Schedule ETL processes execution.

Consistency checks

  1. Develop ET[L] process(es) for consistency checks:
    • steps:
      • (E)xtraction of identifiers and (minimal set of) update marker attribute(s) (timestamp, status, ...) of new/updated items:
        • make sure it can skip ones updated after the latest run of the main ETL process;
      • (T)ransformation:
        • extention with values from the final storage;
        • consistency check (filter for items with inconsistent values);
      • (L)oad to administrator notification system (can be omitted to make use of cron daemon e-mail notifications);
    • chain:
  2. Schedule checks execution.

User interface

  1. Wrap common metadata usage scenarios into a set of parametric requests.
  2. Implement the requests (e.g. as REST API server methods).
  3. Add Web GUI, if required (most likely, there already is (are) some GUI(s), which can be extended with the new functionality or can make use of it for better performance of existing pages).