Skip to content

ES scheme

Evildoor edited this page Feb 28, 2020 · 38 revisions

Overview

Currently, DKB uses elasticsearch as a final storage, mapping can be found in here. A single index (name?) stores 2 types of documents: task and output_dataset. The following tables list fields of the documents.

Columns:

  • Field name
  • Type - note that elasticsearch's mapping has no special definition of lists - for example, integer and list of integers are both defined as "integer", and the field's contents in this regard depend on what was put into it
  • Source from which system the information is retrieved ("derivative" means that it is not present in any source and is constructed from other fields, "service" means that the field is not the part of the data and serves other purposes)
  • Comment
  • Value - how the field is calculated, "as-is" means value of the field with the same name in the source

Tasks

Documents of type task represent the tasks processing ATLAS' data.

Field name Type Source Comment Value
architecture keyword Oracle
campaign text Oracle
chain_data integer Oracle, table ATLAS_DEFT.t_production_task task chain is a sequence of related tasks: each task's output is used as input for the next one list of ids of all tasks in the chain that includes this task, constructed by subquery (tasks after this one are omitted)
core_count short Oracle
ctag keyword Oracle
description text Oracle
end_time date Oracle
energy_gev integer Oracle
geometry_version keyword Oracle
hashtag_list keyword Oracle String is lowercased and split into a list
n_events_per_job long Oracle
n_files_per_job short Oracle
n_files_to_be_used integer Oracle
output_formats keyword Oracle String is split into a list
phys_group text Oracle
pr_id integer Oracle
primary_input text Oracle
processed_events long Oracle
project text Oracle
requested_events long Oracle
run_number integer Oracle
start_time date Oracle
status keyword Oracle
step_id integer Oracle
step_name text Oracle
subcampaign text Oracle
task_timestamp date Oracle
taskid integer Oracle
taskname text Oracle
ticket_id keyword Oracle
total_events long Oracle
trans_home keyword Oracle
trans_path keyword Oracle
trans_uses keyword Oracle
trigger_config keyword Oracle
user_name keyword Oracle
vo keyword Oracle
input_bytes long Rucio as-is, -1 if it is missing or error occurs
primary_input_deleted boolean Rucio False if input_bytes is successfully retrieved from source, True otherwise
primary_input_events long Rucio as-is
hs06 long Chicago ES
toths06 long Chicago ES CPU resources used by the task
toths06_failed long Chicago ES 'wasted' CPU resources
toths06_finished long Chicago ES CPU resources the task would use in the perfect world
chain_id integer Derivative id of the chain's root (the first, initial task in it) derived from chain_data
input_events long Derivative
phys_category keyword Derivative Is determined by hashtag_list and taskname
_update_required boolean Service marks documents that contain incomplete information about object and thus must be updated sooner or later True if the record is incomplete and should be updated, False otherwise

Output datasets

TBD