Skip to content

ETL: glossary

Marina Golosova edited this page Oct 21, 2020 · 2 revisions
Control symbol
ASCII symbol used for flow control (see Internal communication protocol specification).
Data flow
Sequence of messages, produced by the E-stage and passing through all T-stages till the L-stage.
Dataflow topology
Shape of the data flow from the main source to the final storage.
ETL module, stage
Logical unit of the ETL process, implementing single operation on the data flow (extraction, transformation, load). Consists of supervisor and worker.
ETL process, Dataflow
Combination of independent modules, or stages, orchestrated into a dataflow topology connecting the main source and final storage. Its purpose is to monitor the main source for presence of new/updated data and push them into the final storage in a transformed -- prepared for the further usage -- view.
E-, T-, L-stage
ETL stage, responsible for corresponding ETL operation (extraction, transformation or load).
Final storage
Storage, indexing and access system for integrated representation of objects of interest, produced by the ETL process.
Main (primary) source
One of the original metadata sources, containg update mark attribute of the object(s) of interest (preferrable -- in a queryable form).
Message
Object of interest metadata in a pre-defined format (e.g. JSON), possibly containing service fields and ending with a control symbol EOM (end-of-message).
Secondary source
One of the original metadata sources, used to get addintional information of the object of interest by the object's ID.
Supervisor
Stage component responsible for data flow control: read from and write to the flow, mark data in the flow as processed, etc. Can have same implementation for all stages of the same type (E-, T-, L-type stages).
Topology description
Contains list of the ETL stages, their start instructions and linking scheme (which stage's output is which stage's input).
Worker
Stage component responsible for case-specific operations on data: querying external metadata source (main or secondary), transformation or pushing the processing results to the final storage. Usually has an individual implementation for each case-specific operation, but some workers can be reused in similar situations (e.g. in case of format conversion stage).
Worker run instructions
Instruction on how to run given worker's instance (e.g. command line).