Database Structure
SModelS stores all the information about the experimental results in the Database <Database>
. Below we describe both the directory <folderStruct>
, object <objStruct>
structure of the Database <Database>
and how the information in stored in the database is used within SModelS <interpolationDB>
.
The Database <Database>
is organized as files in an ordinary (UNIX) directory hierarchy, with a thin Python layer serving as the access to the database. The overall structure of the directory hierarchy and its contents is depicted in the scheme below (click to enlarge):
As seen above, the top level of the SModelS database categorizes the analyses by LHC center-of-mass energies,
- 8 TeV
- 13 TeV
Also, the top level directory contains a file called version
with the version string of the database. The second level splits the results up between the different experiments:
- 8TeV/CMS/
- 8TeV/ATLAS/
The third level of the directory hierarchy encodes the Experimental Results <ExpResult>
:
- 8TeV/CMS/CMS-SUS-12-024
- 8TeV/ATLAS/ATLAS-CONF-2013-047
- ...
- The Database folder is described by the Database Class
Each Experimental Result <ExpResult>
folder contains:
- a folder for each
DataSet<DataSet>
(e.g.data
) - a
globalInfo.txt
file
The globalInfo.txt
file contains the meta information about the Experimental Result <ExpResult>
. It defines the center-of-mass energy
/literals/globalInfo.txt
- Experimental Result folder is described by the ExpResult Class
- globalInfo files are descrived by the Info Class
Each DataSet<DataSet>
folder (e.g. data
) contains:
- the Upper Limit maps for
UL-type results <ULtype>
or Efficiency maps forEM-type results <EMtype>
(TxName.txt
files) - a
dataInfo.txt
file containing meta information about theDataSet<DataSet>
- Data Set folders are described by the DataSet Class
- TxName files are described by the TxName Class
- dataInfo files are described by the Info Class
Since UL-type results <ULtype>
have a single dataset (see DataSets<DataSet>
), the info file only holds some trivial information, such as the type of Experimental Result <ExpResult>
(UL) and the dataset id (None for UL-type results). Here is the content of CMS-SUS-12-024/data/dataInfo.txt as an example:
/literals/dataInfo.txt
For EM-type results <EMtype>
the dataInfo.txt
contains relevant information, such as an id to identify the DataSet<DataSet>
(signal region), the number of observed and expected background events for the corresponding signal region and the respective signal upper limits. Here is the content of CMS-SUS-13-012-eff/3NJet6_1000HT1250_200MHT300/dataInfo.txt as an example:
/literals/dataInfo-eff.txt
Each DataSet<DataSet>
contains one or more TxName.txt
file storing the bulk of the experimental result data. For UL-type results <ULtype>
, the TxName file contains the UL maps for a given simplified model (element <element>
or sum of elements <element>
), while for EM-type results <EMtype>
the file contains the simplified model efficiencies. In addition, the TxName files also store some meta information, such as the source of the data and the type of result (prompt or displaced). If not specified, the type will be assumed to be prompt.1 For instance, the first few lines of CMS-SUS-12-024/data/T1tttt.txt read:
/literals/T1tttt.txt
As seen above, the first block of data in the file contains information about the element <element>
or simplified model ([[['t','t']],[['t','t']]]) in bracket notation <bracketNotation>
for which the data refers to as well as reference to the original data source and some additional information. The simplified model is assumed to contain neutral BSM final states (MET signature) and arbitrary ( Z2-odd particles <particleClass>
) intermediate BSM states. If the experimental result refers to non-MET final states, the finalState field must list the type of BSM particles <particleClass>
(see UL-type <ULtype>
for more details). An example from the CMS-EXO-12-026/data/THSCPM1b.txt file is shown below:
/literals/THSCPM1b.txt
In addition, if specific BSM intermediate states are required, the intermediateState field must include a nested list (one for each branch) specifying the labels of the intermediate BSM states.2 If the intermediateState field is specified, the corresponding result will only be applied to simplified models containing intermediate BSM particles with the same quantum numbers. One example is shown below:
/literals/TxIntermediate.txt
The second block of data in the TxName.txt
file contains the upper limits or efficiencies as a function of the relevant simplified model parameters:
/literals/T1tttt.txt
As we can see, the data grid is given as a Python array with the structure: [[masses, upper limit], [masses, upper limit], ...]. For prompt analyses, the relevant parameters are usually the BSM masses, since all decays are assumed to be prompt. On the other hand, results for long-lived or meta-stable particles may depend on the BSM widths as well. The width dependence can be easily included through the following generalization:
[[M1, M2...], [MA, MB, ...]] → [[(M1, Γ1), (M2, Γ2)...], [(MA, ΓA), (MB, ΓB), ...]]
In order to make the notation more compact, whenever the width dependence is not included, the corresponding decay will be assumed to be prompt and an effective lifetime reweigthing factor <dbReweighting>
will be applied to the upper limits. For instance, a mixed type data grid is also allowed:
[ [[M1, (M2, Γ2)], [M1, (M2, Γ2)]], UL ], [ [[M1′, (M2′, Γ2′)], [M1′, (M2′, Γ2′)]], UL’ ], ...
The example above represents a simplified model where the decay of the mother is prompt, while the daughter does not have to be stable, hence the dependence on Γ2. In this case, the lifetime reweigthing factor <dbReweighting>
is applied only for the mother decay.
If the analysis signal efficiencies or upper limits are insensitive to some of the simplified model final states, it might be convenient to define inclusive simplified models. A typical case are some of the heavy stable charged particle searches, which only rely on the presence of a non-relativistic charged particle, which leads to an anomalous charged track signature. In this case the signal efficiencies are highly insensitive to the remaining event activity and the corresponding simplified models can be very inclusive. In order to handle this inclusive cases in the database we allow for wildcards when specifying the constraints. For instance, the constraint for the CMS-EXO-13-006 eff/c000/THSCPM3.txt reads:
/literals/THSCPM3.txt
and represents the (inclusive) simplified model:
Note that although the final state represented by "*" is any Z2-even particle <particleClass>
, it must still correspond to a single particle, since the topology specifies a 2-body decay for the initially produced BSM particle. Finally, it might be useful to define even more inclusive simplified models, such as the one in CMS-EXO-13-006 eff/c000/THSCPM4.txt:
/literals/THSCPM4.txt
In the above case the simplified model corresponds to an HSCP being initially produced in association with any BSM particle which leads to a MET signature. Note that "[*]" corresponds to any branch, while ["*"] means any particle:
In such cases the mass array for the arbitrary branch must also be specified as using wildcards:
/literals/THSCPM4.txt
The Database folder structure <folderStruct>
is mapped to Python objects in SModelS. The mapping is almost one-to-one, with a few exceptions. Below we show the overall object structure as well as the folders/files the objects represent (click to enlarge):
The type of Python object (Python class, Python list,...) is shown in brackets. For convenience, below we explicitly list the main database folders/files and the Python objects they are mapped to:
Database <Database>
folder → Database ClassExperimental Result <ExpResult>
folder → ExpResult ClassDataSet<DataSet>
folder → DataSet ClassglobalInfo.txt
file → Info ClassdataInfo.txt
file → Info ClassTxname.txt
file → TxName Class
At the first time of instantiating the Database class, the text files in <database-path> are loaded and parsed, and the corresponding data objects are built. The efficiency and upper limit maps themselves are subjected to standard preprocessing steps such as a principal component analysis and Delaunay triangulation (see below <interpolationDB>
). For the sake of efficiency, the entire database -- including the Delaunay triangulation -- is then serialized into a pickle file (<database-path>/database.pcl), which will be read directly the next time the database is loaded. If any changes in the database folder structure are detected, the python or the SModelS version has changed, SModelS will automatically re-build the pickle file. This action may take a few minutes, but it is again performed only once. If desired, the pickling process can be skipped using the option force_load = `txt' in the constructor of Database .
- The pickle file is created by the createBinaryFile method
All the information contained in the database files <folderStruct>
is stored in the database objects <objStruct>
. Within SModelS the information in the Database <Database>
is mostly used for constraining the simplified models generated by the decomposition <decomposition>
of the input model. Each simplified model (or element <element>
) generated is compared to the simplified models contrained by the database and specified by the constraint and finalStates entries in the TxName files <txnameFile>
. The comparison allows to identify which results can be used to test the input model. Once a matching result is found the upper limit or efficiency must be computed for the given input element <element>
. As described above <txnameFile>
, the upper limits or efficiencies are provided as function of masses and widths in the form of a discrete grid. In order to compute values for any given input element <element>
, the data has to be processed as decribed below.
The efficiency and upper limit maps are subjected to a few standard preprocessing steps. First all the units are removed, the shape of the grid is stored and the relevant width dependence is identified (see discussion above <txnameFile>
). Then the masses and widths are transformed into a flat array:
[[M1, (M2, Γ2)], [MA, (MB, ΓB)]] → [M1, M2, MA, MB, log (1 + Γ2), log (1 + ΓB)]
Finally a principal component analysis and Delaunay triangulation (see figure below <delaunay>
) is applied over the new coordinates. The simplices defined during triangulation are then used for linearly interpolating the transformed data grid, thus allowing SModelS to compute efficiencies or upper limits for arbitrary mass and width values (as long as they fall inside the data grid). As seen above, the width parameters are taken logarithmically before interpolation, which effectively corresponds to an exponential interpolation. If the data grid does not explicitly provide a dependence on all the widths (as in the example above <dataTransf>
), the computed upper limit or efficiency is then reweighted imposing the requirement of prompt decays (see lifetime reweighting <dbReweighting>
for more details). This procedure provides an efficient and numerically robust way of dealing with generic data grids, including arbitrary parametrizations of the mass parameter space, irregular data grids and asymmetric branches.
From v2.0 onwards SModelS allows to include width dependent efficiencies and upper limits. However most experimental results do not provide upper limits (or efficiencies) as a function of the BSM particles' widths, since usually all the decays are assumed to be prompt and the last BSM particle appearing in the cascade decay is assumed to be stable.3 In order to apply these results to models which may contain meta-stable particles, it is possible to approximate the dependence on the widths for the case in which the experimental result requires all BSM decays to be prompt and the last BSM particle to be stable or decay outside the dector. In SModelS this is done through a reweighting factor which corresponds to the fraction of prompt decays (for intermediate states) and decays outside the detector (for final BSM states) for a given set of widths. For instance, asumme an EM-type result <EMtype>
only provides efficiencies (ϵprompt) for prompt decays:
Then, for other values of the widths, an effective efficiency (ϵeff) can be approximated by:
ϵeff = ξ × ϵprompt , where ξ = ℱprompt(ΓX1) × ℱprompt(ΓX2) × ℱlong(ΓY1) × ℱlong(ΓY2)
In the expression above ℱprompt(Γ) is the probability for the decay to be prompt given a width Γ and ℱlong(Γ) is the probability for the decay to take place outside the detector. The precise values of ℱprompt and ℱlong depend on the relevant detector size (L), particle mass (M), boost (β) and width (Γ), thus requiring a Monte Carlo simulation for each input model. Since this is not within the spirit of the simplified model approach, we approximate the prompt and long-lived probabilities by:
where Louter is the approximate size of the detector (which we take to be 10 m for both ATLAS and CMS), Linner is the approximate radius of the inner detector (which we take to be 1 mm for both ATLAS and CMS). Finally, we take the effective time dilation factor to be ⟨γβ⟩ = 1.3 when computing ℱprompt and ⟨γβ⟩ = 1.43 when computing ℱlong. We point out that the above approximations are irrelevant if Γ is very large (ℱprompt ≃ 1 and ℱlong ≃ 0) or close to zero (ℱprompt ≃ 0 and ℱlong ≃ 1). Only elements containing particles which have a considerable fraction of displaced decays will be sensitive to the values chosen above. Also, a precise treatment of lifetimes is possible if the experimental result (or a theory group) explicitly provides the efficiencies as a function of the widths, as discussed above <widthGrid>
.
The above expressions allows the generalization of the efficiencies computed assuming prompt decays to models with meta-stable particles. For UL-type results <ULtype>
the same arguments apply with one important distinction. While efficiencies are reduced for displaced decays (ξ < 1), upper limits are enhanced, since they are roughly inversely proportional to signal efficiencies. Therefore, for UL-type results <ULtype>
, we have:
σeffUL = σpromptUL/ξ
Finally, we point out that for the experimental results which provide efficiencies or upper limits as a function of some (but not all) BSM widths appearing in the simplified model (see the discussion above <widthGrid>
), the reweighting factor ξ is computed using only the widths not present in the grid.
Prompt results are all those which assumes all decays to be prompt and the last BSM particle to be stable (or decay outside the detector). Searches for heavy stable charged particles (HSCPs), for instance, are classified as prompt, since the HSCP is assumed to decay outside the detector. Displaced results on the other hand require at least one decay to take place inside the detector.↩
Although the finalState and intermediateState fields could be combined into a single entry, they are kept separate for backward compatibility.↩
An obvious exception are searches for long-lived particles with displaced decays.↩