# What is ServiceX?

<img src="img/logo_servicex.png" width=150 height=150 />

ServiceX is a <span style="color:red">scalable</span> <span style="color:blue">HEP event data</span> <span style="color:purple">extraction</span>, <span style="color:orange">transformation</span> and <span style="color:green">delivery</span> system
- <span style="color:blue"> HEP event data</span>: supports various input data formats - ROOT Ntuple (CMS nanoAOD), ATLAS xAOD, future data formats
- <span style="color:purple"> Extraction</span>: user-selected column(s) with filtering
- <span style="color:orange"> Transformation</span>: transform into various formats - Awkward arrays, Apache Parquet, ROOT Ntuple
- <span style="color:green"> Delivery </span>: on-demand delivery to a user or streaming into Analysis System from a remote via Rucio or XRootD
- <span style="color:red">Scalable</span>: runs on any Kubernetes cluster, scales up workers if necessary

<br>
<br>

### Example ServiceX workflow

<img src="img/ServiceX_workflow_2.png" width=800 />

1. A user makes a ServiceX delivery request from Jupyter notebook via a REST interface
1. ServiceX backend looks for input datasets and retrieves an input file list
1. A relevant code is generated based on the input data format, query in func-adl, and so on
1. Transformer pods (workers) are generated to process each file (10 pods at first and scale up if necessary)
1. Outputs are streamed into the object store inside the Kubernetes cluster
1. Download outputs asynchronously

<br>
<br>

### Where is ServiceX?

- ServiceX is deployed on Kubernetes cluster 
    - Enough resource to scale pods 
    - Preferred to be co-located with a data center for high network bandwidth
- Types
    - Stand-alone: Secured by own authentication system. Web API is accessible from anywhere.
    - Integrated into coffea-casa: Secured by CERN authentication system. Only accessible inside a coffea-casa.
- Input data format
    - Dedicated ServiceX deployment for each input data format: ROOT ntuple, ATLAS xAOD, CMS Run-1 AOD
    - Single deployment for all types of input data is currently under development
- Available ServiceX endpoints

| Type | Input data format | Location | Endpoint |
| :----: | :-----------------: | :--------: | :--------: |
| Stand-alone | ATLAS ROOT Ntuple | SSL-River | https://uproot-atlas.servicex.ssl-hep.org/ |
| Stand-alone | ATLAS xAOD | SSL-River | https://xaod.servicex.ssl-hep.org/ |
| Stand-alone | OpenData (ROOT Ntuple) | SSL-River | https://atlasopendata.servicex.ssl-hep.org/ |
| Stand-alone | ATLAS ROOT Ntuple | UC Analysis Facility | https://uproot-atlas.servicex.af.uchicago.edu/ |
| Stand-alone | ATLAS xAOD | UC Analysis Facility | https://xaod.servicex.af.uchicago.edu/ |
| Coffea-casa | CMS ROOT Ntuple | UNL | https://coffea.casa/ |
| Coffea-casa | OpenData (ROOT Ntuple) | UNL | https://coffea-opendata.casa |
| Coffea-casa | OpenData (ROOT Ntuple) | UC Analysis Facility | http://coffea.af.uchicago.edu/ |

<p style="text-align: right;"> *There are other experimental endpoints </p>

<br>
<br>

# ServiceX Client Library

The "base" library to interact with ServiceX

- Makes a request to a ServiceX backend
- Monitors the progress of transformation
- Download files as soon as they are ready


### Prerequisites

- Python 3.6 or higher
- A `ServiceX` endpoint


### Configuration file

- Contains endpoint access information
- A config file can contain multiple endpoints
- Optionally other information (such as `cache_path`) can be placed
- The config file can be called `servicex.yaml`, `servicex.yml`, or `.servicex`. The files are searched in that order, and all present are used.
- Searches in the current working directory, and above, and home directory (`$HOME` on Linux and Mac, and profile directory on Windows)

```
api_endpoints:
  - name: <your-endpoint-name>
    endpoint: <your-endpoint>
    token: <api-token>
    type: uproot
```


### Local data cache

- ServiceX requests and returned data are stored in a local temporary directory by default
- The same query on the same dataset won't create a new request to the ServiceX endpoint
- The cache is unbound: it will continuously fill up
- Cache path is a temporary directory of your system, but you can (or need to) change the path from the configuration file

```
cache_path: <cache-path>
api_endpoints:
  - name: <your-endpoint-name>
    endpoint: <your-endpoint>
    token: <api-token>
    type: uproot
```

<br>
<br>

# ServiceX libraries for USER

- ServiceX client lirary accepts query in the format of `qastle`
- `qastle` is not intended to be written by a user
- Libraries to generate `qastle` query
  - `func-adl-servicex`
  - `tcut-to-qastle`


### Installation

Umbrella package includes both packages at [PyPI](https://pypi.org/project/servicex-clients/)

`pip install servicex-clients`

*pre-installed in coffea-casa opendata environment

### With func-adl expression

- `func-adl-servicex` utilizes func-adl expression to generate `qastle` query
- Provides objects (`ServiceXSourceUpROOT` `ServiceXSourceXAOD` `ServiceXSourceCMSRun1AOD`) that can be used as a root of a func-adl query

### Hands-on ATLAS OpenData

Import libraries relevant for the ServiceX backend.

In [4]:
from servicex import ServiceXDataset
from func_adl_servicex import ServiceXSourceUpROOT

Define input dataset - ATLAS opendata available via XRootD protocol.

In [5]:
dataset_opendata = "root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root"

Specify transformer docker image.

In [6]:
uproot_transformer_image = "sslhep/servicex_func_adl_uproot_transformer:develop"

Now define ServiceXDataset with the input data, transformer image, and the backend name defined in the configuration file.

In [4]:
sx_dataset = ServiceXDataset(dataset_opendata, image=uproot_transformer_image, backend_name='uproot')

Let's define the Uproot dataset object with tree name in the file (`mini`).

In [5]:
ds = ServiceXSourceUpROOT(sx_dataset, "mini")
# ds.return_qastle = True

Now it's time to write the func-adl query.

In [6]:
ds_query = ds \
    .Select("lambda event: {'lep_pt': event.lep_pt, 'lep_eta': event.lep_eta}")

Let's specify the output data format as Pandas DataFrame.

In [7]:
data = ds_query.AsPandasDF()

Nothing happend behind yet.

It's finally time to make your request!

In [8]:
out = data.value()

In [9]:
out

Unnamed: 0,lep_pt,lep_eta
0,"[51905.457, 41248.57, 16397.67, 7471.2275]","[-0.9257092, -0.8236952, -0.48641676, 0.26671788]"
1,"[41430.645, 40307.168, 16133.789, 7481.8574]","[-1.2331822, -0.396434, -0.5415077, -0.3021792]"
2,"[33646.71, 27313.271, 20035.95, 16472.64]","[-0.03232379, -0.044152576, 0.06701253, 1.8595..."
3,"[77118.56, 27845.74, 17726.541, 14714.521]","[0.51476425, 0.8453112, 2.1891582, 0.17971124]"
4,"[161909.22, 53367.754, 25596.69, 18864.479]","[-1.0373538, -0.8217277, -1.2618828, 0.12619522]"
...,...,...
164711,"[32143.482, 24158.068, 17203.547, 14358.152]","[-1.0038322, 0.60944843, 0.87633955, 1.0397455]"
164712,"[39488.273, 33694.094, 32709.998, 14797.52]","[0.1847904, 0.7994406, -0.4549886, -1.1673094]"
164713,"[63284.21, 22707.84, 15635.994, 14873.25]","[0.93559146, 0.18448293, 0.17450815, 2.1288664]"
164714,"[52538.805, 40321.457, 25766.85, 19381.92]","[0.8802495, 1.2056149, 1.7011378, 0.85303867]"


Now you have your requested columns!

<br>

### With TCut expression

- `tcut-to-qastle` translates TCut syntax into `qastle` query that is accepted by ServiceX client library
- Supported expressions
  - Arithmetic operators: `+, -, *, /`
  - Logical operators: `!, &&, ||`
  - Relational and comparison operators: `==, !=, >, <, >=, <=`
  - Ternary operator: `(A?B:C)` - has to be enclosed in parentheses
  - Mathematical function: `sqrt`
- Usage
  - `qastle_query = tcut_to_qastle.translate(<tree_name>, <selected_columns>, <tcut_selection>)`
- Limitation
    - Only worked with input data format of ROOT Ntuple
    - `<tcut_selection>` only works with scalar type branches (crashes if a datatype of TTree branch is `std::vector`)

### Hands-on ATLAS OpenData

We already imported the ServiceX client library

`from servicex import ServiceXDataset`

Let's import tcut-to-qastle library

In [7]:
from tcut_to_qastle import translate

ServiceX Dataset is also defined above:

`sx_dataset = ServiceXDataset(dataset_opendata, image=uproot_transformer_image, backend_name='uproot')`

Let's generate qastle query using tcut-to-qastle.

First argument is the tree name, and the second is the list of columns. The third is the selection we apply, but the datatype of branch is `std::vector`. So, leave it empty.

In [8]:
query = translate("mini", "lep_pt, lep_eta")

Let's have a look how the qastle query looks like.

In [12]:
query

"(Select (call EventDataset 'ServiceXDatasetSource' 'mini') (lambda (list event) (dict (list 'lep_pt' 'lep_eta') (list (attr event 'lep_pt') (attr event 'lep_eta')))))"

Now let's make a request using the function provided by ServiceX client library.

In [13]:
r = sx_dataset.get_data_pandas_df(query)

In [14]:
r

Unnamed: 0,lep_pt,lep_eta
0,"[51905.457, 41248.57, 16397.67, 7471.2275]","[-0.9257092, -0.8236952, -0.48641676, 0.26671788]"
1,"[41430.645, 40307.168, 16133.789, 7481.8574]","[-1.2331822, -0.396434, -0.5415077, -0.3021792]"
2,"[33646.71, 27313.271, 20035.95, 16472.64]","[-0.03232379, -0.044152576, 0.06701253, 1.8595..."
3,"[77118.56, 27845.74, 17726.541, 14714.521]","[0.51476425, 0.8453112, 2.1891582, 0.17971124]"
4,"[161909.22, 53367.754, 25596.69, 18864.479]","[-1.0373538, -0.8217277, -1.2618828, 0.12619522]"
...,...,...
164711,"[32143.482, 24158.068, 17203.547, 14358.152]","[-1.0038322, 0.60944843, 0.87633955, 1.0397455]"
164712,"[39488.273, 33694.094, 32709.998, 14797.52]","[0.1847904, 0.7994406, -0.4549886, -1.1673094]"
164713,"[63284.21, 22707.84, 15635.994, 14873.25]","[0.93559146, 0.18448293, 0.17450815, 2.1288664]"
164714,"[52538.805, 40321.457, 25766.85, 19381.92]","[0.8802495, 1.2056149, 1.7011378, 0.85303867]"


Output is directly from the local cache!

You can force to ignore local cache if needed.

In [19]:
sx_dataset_woCache = ServiceXDataset(dataset_opendata, image=uproot_transformer_image, \
                                     backend_name='uproot', ignore_cache=True)

Let's deliver as a Apache parquet file.

In [20]:
out_parquet = sx_dataset_woCache.get_data_parquet(query)
print(out_parquet)

root://eospublic.cer...:   0%|          | 0/9000000000.0 [00:00]

        root://eospublic.cer... Downloaded:   0%|          | 0/9000000000.0 [00:00]

[PosixPath('servicex-cache/data/5f5c30ac-813d-494d-a7ba-ce4cf4b609a8/root___eospublic.cern.ch__eos_opendata_atlas_OutreachDatasets_2020-01-22_4lep_MC_mc_345060.ggH125_ZZ4lep.4lep.root.parquet')]


Let's check the output.

In [24]:
import awkward as ak
ak_arr = ak.from_parquet(out_parquet)
print(f"lep_pt: {ak_arr.lep_pt[0]}")
print(f"lep_eta: {ak_arr.lep_eta[0]}")

lep_pt: [5.19e+04, 4.12e+04, 1.64e+04, 7.47e+03]
lep_eta: [-0.926, -0.824, -0.486, 0.267]


### Other ServiceX endpoint for ATLAS OpenData 



We've been using the ServiceX OpenData endpoint deployed inside coffea-casa analysis facility.

In [2]:
!cat ../.servicex.yaml

cat: .servicex.yml: No such file or directory


Let's try other ServiceX endpoint, which is a stand-alone. 

Here we can only try endpoints for OpenData, but you can try other experiment dedicate endpoints that you belong to.

Let's create a ServiceX client configuration file for the endpoint at SSL-River.

In [9]:
%%writefile servicex.yaml
api_endpoints:
  - name: opendata_river
    endpoint: https://atlasopendata.servicex.ssl-hep.org/
    type: uproot

Overwriting servicex.yaml


Let's create ServiceX Dataset with backend_name `opendata_river`

In [10]:
sx_dataset_river = ServiceXDataset(dataset_opendata, image=uproot_transformer_image, \
                                   backend_name='opendata_river')

Now the request will be sent to the endpoint at SSL-River.

For stand-alone endpoints, you can monitor status of trasnformations in the dashboard:
[SSL-River OpenData Dashboard](https://atlasopendata.servicex.ssl-hep.org/global-dashboard)

In [16]:
out_from_river = sx_dataset_river.get_data_pandas_df(query)

In [15]:
out_from_river

Unnamed: 0,lep_pt,lep_eta
0,"[51905.457, 41248.57, 16397.67, 7471.2275]","[-0.9257092, -0.8236952, -0.48641676, 0.26671788]"
1,"[41430.645, 40307.168, 16133.789, 7481.8574]","[-1.2331822, -0.396434, -0.5415077, -0.3021792]"
2,"[33646.71, 27313.271, 20035.95, 16472.64]","[-0.03232379, -0.044152576, 0.06701253, 1.8595..."
3,"[77118.56, 27845.74, 17726.541, 14714.521]","[0.51476425, 0.8453112, 2.1891582, 0.17971124]"
4,"[161909.22, 53367.754, 25596.69, 18864.479]","[-1.0373538, -0.8217277, -1.2618828, 0.12619522]"
...,...,...
164711,"[32143.482, 24158.068, 17203.547, 14358.152]","[-1.0038322, 0.60944843, 0.87633955, 1.0397455]"
164712,"[39488.273, 33694.094, 32709.998, 14797.52]","[0.1847904, 0.7994406, -0.4549886, -1.1673094]"
164713,"[63284.21, 22707.84, 15635.994, 14873.25]","[0.93559146, 0.18448293, 0.17450815, 2.1288664]"
164714,"[52538.805, 40321.457, 25766.85, 19381.92]","[0.8802495, 1.2056149, 1.7011378, 0.85303867]"


<br>
<br>

# ServiceX DataBinder

- ServiceX client libraries are all you need to interact with ServiceX backend and handle outputs.
- DataBinder is a python package to make your life easier when you need to run a large number of datasets (e.g. full Run-2 physics analysis).
- It handles all your ServiceX requests and delivered data from a single configuration file.

### Configuration file

```
General:
  ServiceXBackendName: uproot
  OutputDirectory: /path/to/output
  OutputFormat: parquet
  
Sample:
  - Name: ttH
    RucioDID: user.kchoi:user.kchoi.sampleA, 
             user.kchoi:user.kchoi.sampleB
    Tree: nominal
    FuncADL: "Select(lambda event: {'jet_e': event.jet_e, 'jet_pt': event.jet_pt})"
  - Name: ttW
    RucioDID: user.kchoi:user.kchoi.sampleC
    Tree: nominal
    Filter: n_jet > 5 
    Columns: jet_e, jet_pt
```

- Configuration file is in yaml format

| Option for `General` | Description       | DataType |
|:--------:|:------:|:------|
| `ServiceXBackendName` | ServiceX backend name in your `servicex.yaml` file <br> (name should contain either `uproot` or `xAOD` to distinguish the type of transformer) | `String` |
| `OutputDirectory` | Path to the directory for ServiceX delivered files | `String` |
| `OutputFormat` | Output file format of ServiceX delivered data (`parquet` or `root` for `uproot` / `root` for `xaod`) | `String` |
| `ZipROOTColumns` | Zip columns that share prefix to generate one counter branch (see detail at [uproot readthedoc](https://uproot.readthedocs.io/en/latest/basic.html#writing-ttrees-to-a-file)) | `Boolean` |
| `WriteOutputDict` | Name of an ouput yaml file containing Python nested dictionary of output file paths (located in the `OutputDirectory`) | `String` |
| `IgnoreServiceXCache` | Ignore the existing ServiceX cache and force to make ServiceX requests | `Boolean` |

| Option for `Sample` | Description       |DataType |
|:--------:|:------:|:------|
| `Name`   | sample name defined by a user |`String` |
| `RucioDID` | Rucio Dataset Id (DID) for a given sample; <br> Can be multiple DIDs separated by comma |`String` |
| `XRootDFiles` | XRootD files (e.g. `root://`) for a given sample; <br> Can be multiple files separated by comma |`String` |
| `Tree` | Name of the input ROOT `TTree` (`uproot` ONLY) |`String` |
| `Filter` | Selection in the TCut syntax, e.g. `jet_pt > 10e3 && jet_eta < 2.0` (TCut ONLY) |`String` |
| `Columns` | List of columns (or branches) to be delivered; multiple columns separately by comma (TCut ONLY) |`String` |
| `FuncADL` | func-adl expression for a given sample (func adl ONLY) |`String` |




### Installation

In [18]:
# !pip install servicex-databinder

### Deliver data

```
from servicex_databinder import DataBinder
sx_db = DataBinder('<CONFIG>.yml')
out = sx_db.deliver()
```

### Hands-on ATLAS OpenData

Let's create a configuration file 
- ServiceX requests to SSL-River ServiceX endpoint
- Outputs in ROOT Ntuple
- Zip `std::vector` branches if prefix is the same
- Write a yaml file containing the paths of output files
- Two datasets
- Deliver two branches without selection

In [None]:
%%writefile config_databinder_opendata.yaml
General:
  ServiceXBackendName: opendata_river
  OutputDirectory: ServiceXData_atlasopendata
  OutputFormat: root
  ZipROOTColumns: True
  WriteOutputDict: out_atlasopendata  

Sample:
  - Name: data
    XRootDFiles: root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2020-01-22/4lep/Data/data_A.4lep.root,
            root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2020-01-22/4lep/Data/data_B.4lep.root,
            root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2020-01-22/4lep/Data/data_C.4lep.root,
            root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2020-01-22/4lep/Data/data_D.4lep.root
    Tree: mini
    FuncADL: "Select(lambda event: {'lep_pt': event.lep_pt, 'lep_eta': event.lep_eta})"
  - Name: ggH125_ZZ4lep
    XRootDFiles: root://eospublic.cern.ch//eos/opendata/atlas/OutreachDatasets/2020-01-22/4lep/MC/mc_345060.ggH125_ZZ4lep.4lep.root
    Tree: mini
    FuncADL: "Select(lambda event: {'lep_pt': event.lep_pt, 'lep_eta': event.lep_eta})"

Now import the package and load the configuration file you wrote.

In [None]:
from servicex_databinder import DataBinder
sx_db = DataBinder('config_databinder_opendata.yaml')

Time to make ServiceX delivery requests!

In [None]:
out = sx_db.deliver()

In [None]:
out

<br>
<br>

# Outlook

- ServiceX provides whole new experiences in accessing data
- Talked ServiceX itself, but it can be (and already being) integrated into other frameworks such as coffea
- ServiceX is still improving! Stay tuned!

# Useful links

- [Readthedoc - ServiceX](https://servicex.readthedocs.io/en/latest/)
- [Github - ServiceX Frontend](https://github.com/ssl-hep/ServiceX_frontend)
- [Github - func-adl-servicex](https://github.com/iris-hep/func_adl_servicex)
- [Github - tcut-to-qastle](https://github.com/ssl-hep/TCutToQastleWrapper)