Skip to content

ES scheme

Evildoor edited this page Mar 5, 2020 · 38 revisions

Overview

Currently, DKB uses elasticsearch as a final storage, mapping can be found in here. A single index (name?) stores 2 types of documents: task and output_dataset. The following tables list fields of the documents.

Columns:

  • Field name
  • Type - note that elasticsearch's mapping has no special definition of lists - for example, integer and list of integers are both defined as "integer", and the field's actual contents, in this regard, depend on what was put into it. Some fields are stored in multiple types, in such cases the additional types are listed in brackets.
  • Source from which system the information is retrieved ("derivative" means that it is not present in any source and is constructed from other fields, "service" means that the field is not the part of the data and serves other purposes)
  • Comment
  • Value - how the field is calculated, "as-is" means value of the field with the same name in the source

Tasks

Documents of type task represent the tasks processing ATLAS' data.

Field name Type Source Comment Value
architecture keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
campaign text (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
chain_data integer Oracle, table ATLAS_DEFT. t_production_task task chain is a sequence of related tasks: each task's output is used as input for the next one list of ids of all tasks in the chain that includes this task, constructed by subquery (tasks after this one are omitted)
conditions_tags keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
core_count short Oracle, table t_task Extracted from source field jedi_task_ parameters
ctag keyword Oracle, table ATLAS_DEFT. t_production_task as-is
description text Oracle, table ATLAS_DEFT. t_prodmanager_ request as-is
end_time date Oracle, table ATLAS_DEFT. t_production_task source field endtime
energy_gev integer Oracle, table ATLAS_DEFT. t_prodmanager_ request as-is
geometry_version keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
hashtag_list keyword Oracle String is lowercased and split into a list
n_events_per_job long Oracle, table t_task Extracted from source field jedi_task_ parameters
n_files_per_job short Oracle, table t_task Extracted from source field jedi_task_ parameters
n_files_to_be_used integer Oracle, table ATLAS_DEFT. t_production_task source field filestobeused
output_formats keyword Oracle String is split into a list
phys_group text (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
pr_id integer (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
primary_input text (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
processed_events long Oracle, table ATLAS_PANDA. jedi_datasets Sum of source's neventsused corresponding to given taskid if it is not Null, total_events otherwise
project text (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
requested_events long Oracle, table ATLAS_PANDA. jedi_datasets Sum of source's nevents corresponding to given taskid
run_number integer (keyword) Oracle, table t_task Extracted from source field jedi_task_ parameters
start_time date Oracle, table ATLAS_DEFT. t_production_task as-is
status keyword Oracle, table ATLAS_DEFT. t_production_task as-is
step_id integer Oracle, table ATLAS_DEFT. t_production_task as-is
step_name text (keyword) Oracle, table ATLAS_DEFT. t_step_template as-is
subcampaign text (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
task_timestamp date Oracle, table ATLAS_DEFT. t_production_task source field timestamp
taskid integer (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
taskname text (keyword) Oracle, table ATLAS_DEFT. t_production_task as-is
ticket_id keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
total_events long Oracle, table ATLAS_DEFT. t_production_task as-is
trans_home keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
trans_path keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
trans_uses keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
trigger_config keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
user_name keyword Oracle, table ATLAS_DEFT. t_production_task source field username
vo keyword Oracle, table t_task Extracted from source field jedi_task_ parameters
input_bytes long Rucio as-is, -1 if it is missing or error occurs
primary_input_deleted boolean Rucio False if input_bytes is successfully retrieved from source, True otherwise
primary_input_events long Rucio as-is
hs06 long Chicago ES, index tasks_archive_* (clarify this) source field cputime
toths06 long Chicago ES, index jobs_archive_* (clarify this) CPU resources used by the task Sum of source's hs06sec where jobstatus is failed or finished
toths06_failed long Chicago ES, index jobs_archive_* (clarify this) 'wasted' CPU resources Sum of source's hs06sec where jobstatus is failed
toths06_finished long Chicago ES, index jobs_archive_* (clarify this) CPU resources the task would use in the perfect world Sum of source's hs06sec where jobstatus is finished
chain_id integer Derivative id of the chain's root (the first, initial task in it) derived from chain_data
input_events long Derivative Is calculated from several other fields' values
phys_category keyword Derivative physics category with which the task can be associated Is determined by hashtag_list and taskname
_update_required boolean Service marks documents that contain incomplete information about object and thus must be updated sooner or later True if the record is incomplete and should be updated, False otherwise

Output datasets

Documents of type output_dataset represent the datasets generated by the tasks while processing ATLAS' data.

Field name Type Source Comment Value
datasetname text (keyword) Oracle, table ATLAS_PANDA. jedi_datasets full name of the dataset as-is
bytes long Rucio size of the dataset as-is, -1 if dataset was not found in source
deleted boolean Rucio whether the dataset was deleted from source or not as-is, True if dataset was not found in source
events long Rucio number of events in the dataset as-is
data_format keyword Derivative extracted from datasetname
cross_section double AMI source field crossSection
cross_section_ref keyword AMI source field crossSectionRef
gen_filt_eff double AMI source field genFiltEff
k_factor double AMI source field kFactor
me_pdf keyword AMI source field mePDF
process_group keyword AMI source field processGroup
_update_required boolean Service marks documents that contain incomplete information about object and thus must be updated sooner or later True if the record is incomplete and should be updated, False otherwise