# DQM Playground Data Engineering Test Client

A sample API client for accessing DQM Playground's data. 

## Install the auto-generated API client
Since we are using OpenAPI specification, we can automatically generate a python client using:
- The API schema generated by DQM Playground [here](https://ml4dqm-playground.web.cern.ch/openapi?format=openapi-json)
- The Swagger client generator [here](https://editor.swagger.io/)

We download the client and can install like like so:

In [None]:
!pip install -e ./client -U --user
!pip install -r requirements.txt

In [None]:
import swagger_client
from swagger_client.rest import ApiException
from pprint import pprint
import requests

## Configure the client
Replace `API_TOKEN` here.

In [None]:
# Configure and create an API client
# using an API token
API_TOKEN = None
MLP_URL = "https://ml4dqm-playground.web.cern.ch"
MLP_DEVELOP_URL = "https://ml4dqm-playground-develop.web.cern.ch"

configuration = swagger_client.Configuration()
configuration.host = MLP_URL
client = swagger_client.ApiClient(configuration)
client.set_default_header(header_name="Content-Type", header_value="application/json")
client.set_default_header(header_name="Authorization", header_value=f"Token {API_TOKEN}")

api_instance = swagger_client.ApiApi(client)

## Fetch data

<div class="alert alert-block alert-info">
    <b>Note:</b> The fetched data is paginated by 50, meaning that each response will
    have at most 50 results. You can ask for more by specifying a page number.
</div>
<div class="alert alert-block alert-warning">
    <b>Warning:</b> The response should normally contain <b>next</b>, <b>previous</b> and <b>count</b> keys,
    showing the total number of results available. This does not currently work with
    this automatically generated client.
</div>

### Run information

Available parameters for filtering Runs are:


- `page = 56` # int | A page number within the paginated result set. (optional
- `run_number = 'run_number_example'` # str | run_number (optional)
- `run_date = 'run_date_example'` # str | run_date (optional)
- `year = 'year_example'` # str | year (optional)
- `period = 'period_example'` # str | period (optional)
- `_date = '_date_example'` # str | date (optional)
- `oms_fill = 'oms_fill_example'` # str | oms_fill (optional)
- `oms_lumisections = 'oms_lumisections_example'` # str | oms_lumisections (optional)
- `oms_initial_lumi = 'oms_initial_lumi_example'` # str | oms_initial_lumi (optional)
- `oms_end_lumi = 'oms_end_lumi_example'` # str | oms_end_lumi (optional)

In [None]:
run_number = 315741  # an int will also do
run = api_instance.list_runs(run_number=run_number)
pprint(run)

### Lumisections information

<div class="alert alert-block alert-info">
    <b>Note:</b> Currently not much info on Lumisections is available through the API
</div>

Available parameters for filtering Lumisections are:

- `page = 56` # int | A page number within the paginated result set. (optional)
- `run = 'run_example'` # str | run (optional)
- `ls_number = 'ls_number_example'` # str | ls_number (optional)
- `_date = '_date_example'` # str | date (optional)
- `oms_zerobias_rate = 'oms_zerobias_rate_example'` # str | oms_zerobias_rate (optional)

In [None]:
# BUG FIXED - You can use run_number here to fetch all lumisections for a specific run
run_number = 315257 
ls = api_instance.list_lumisections(run=run_number)
pprint(ls)

### Lumisection Histogram 1D

Available parameters for filtering 1D Lumisection Histograms are:
- `page = 56` # int | A page number within the paginated result set. (optional)
- `lumisection__run__run_number = 'lumisection__run__run_number_example'` # str | lumisection__run__run_number (optional)
- `lumisection__run__run_number__gte = 'lumisection__run__run_number__gte_example'` # str | lumisection__run__run_number__gte (optional)
- `lumisection__run__run_number__lte = 'lumisection__run__run_number__lte_example'` # str | lumisection__run__run_number__lte (optional)
- `lumisection__ls_number = 'lumisection__ls_number_example'` # str | lumisection__ls_number (optional)
- `lumisection__ls_number__gte = 'lumisection__ls_number__gte_example'` # str | lumisection__ls_number__gte (optional)
- `lumisection__ls_number__lte = 'lumisection__ls_number__lte_example'` # str | lumisection__ls_number__lte (optional)
- `entries__gte = 'entries__gte_example'` # str | entries__gte (optional)
- `entries__lte = 'entries__lte_example'` # str | entries__lte (optional)
- `title = 'title_example'` # str | title (optional)
- `lumisection__ls_number__in = 'lumisection__ls_number__in_example'` # str | lumisection__ls_number__in (optional)
- `lumisection__run__run_number__in = 'lumisection__run__run_number__in_example'` # str | lumisection__run__run_number__in (optional)
- `source_data_file__filepath__contains = 'source_data_file__filepath__contains_example'` # str | source_data_file__filepath__contains (optional)

In [None]:
# Get by title
title = "Summary_ClusterStoNCorr__OnTrack__TEC__MINUS__wheel__7"
lh1d = api_instance.list_lumisection_histogram1_ds(title=title) 
print(f"Got {len(lh1d.results)} results!\n")
pprint(lh1d)

In [None]:
# Get by run number
run_num = 315267
lh1d = api_instance.list_lumisection_histogram1_ds(
    lumisection__run__run_number=run_num
) 
print(f"Got {len(lh1d.results)} results!\n")
pprint(lh1d)

In [None]:
# Delete by Lumisection Histogram id
# The id can be retrieved by any of the methods above
api_instance.destroy_lumisection_histogram1_d(id=41) 

### Lumisection Histogram 2D
Same filters with 1D Histograms apply.

In [None]:
# By run number
run_num = 297656
lh2d = api_instance.list_lumisection_histogram2_ds(
    lumisection__run__run_number=run_num
) 
print(f"Got {len(lh2d.results)} results!\n")
# Warning -- this contains a LOT of data
#pprint(lh2d)

In [None]:
# By filename
filename = 'ZeroBias_2017B_DataFrame_2D_17.csv'
lh2d = api_instance.list_lumisection_histogram2_ds(
    source_data_file__filepath__contains=filename
)  
print(f"Got {len(lh2d.results)} results!\n")

### Working with the data received
The data returned from the client we have been using until here are **lists of instances of custom classes**, which reflect the data stored in the database that the API operates upon.
These classes have a `to_dict()` method which can be used to convert them to dictionaries and, then, use whatever data format you require.

#### Convert to pandas DataFrame
An example for 1D Lumisection histograms, using the `pd.DataFrame.from_dict()` method to load them.

In [None]:
import pandas as pd

lh1d_df = pd.DataFrame.from_dict([result.to_dict() for result in lh1d.results])

print(lh1d_df.head())

# Managing DQM Histogram Data Files

DQM data files are `.csv` files (`.root` support for nanoDQM files will be added in the future) which contain:
- Run Histograms or
- Lumisection 1D Histograms or
- Lumisection 2D Histograms

They are represented in DQM Playground's database with the `HistogramDataFile` model. 

By default, these files are expected to reside in `/eos/project/c/cmsml4dc`. This directory is not being monitored for changes, and new files can only be discovered using a [custom Django Management command](https://cmstrackerdpg.github.io/cms-tkdpg-software-knowledge-transfer/mlplayground/apps/histogram_file_manager/management/). 

The contents of a Histogram Data Files are unknown, until a user starts parsing the file by specifying three options:
- The `granularity` of the data (`run` or `lumisection`),
- The `data_dimensionality` of the data (`1` for 1D or `2` for 2D) and
- The file `file_format` (`csv` only available currently).

Depending on the combination of the three options above, DQM Playground will try to extract the appropriate histograms from the Histogram Data File, namely:

- `granularity=run`, `data_dimensionality=1` and `file_format=csv` will produce `RunHistogram`s,
- `granularity=lum`, `data_dimensionality=1` and `file_format=csv` will produce `LumisectionHistogram1D`s,
- `granularity=lum`, `data_dimensionality=2` and `file_format=csv` will produce `LumisectionHistogram2D`s.

## Listing all the available Histogram Data Files

To list all the Histogram Data Files that DQM Playground has discovered:

In [None]:
available_hdf = api_instance.list_histogram_data_files(page=1) # No filters supplied
print(f"Got {len(available_hdf.results)} Histogram Data Files")  # Keep in mind that results are returned in pages of 50

## Filtering histogram data files by filename
For this example, we will try to get the status of the `ZeroBias_2017B_DataFrame_2D_17.csv` DQM file.

In [None]:
available_hdf = api_instance.list_histogram_data_files(filepath__contains="ZeroBias_2017B_DataFrame_2D_17")
print(available_hdf.results)

If parsing has already been started for this specific Histogram Data File, we will see the `data_dimensionality` and `granularity` values already set. If not, they will be unknown. 

Once parsing of the file has started, the total number of `run_histograms`, `lumisection_histograms_1d` or `lumisection_histograms_2d` that were extracted from this file will be available under the respective values in the response. 

Its parsing percentage can also be seen under the `percentage_processed` value.

## Starting parsing a Histogram Data File
To start parsing a Histogram Data file, we need:
- Its unique `id`,
- The `granularity` of the data contained (`run` or `lumisection`),
- The `data_dimensionality` of the data (`1` for 1D or `2` for 2D) and
- The file `file_format` (`csv`).

In [None]:
available_hdf = api_instance.list_histogram_data_files(filepath__contains="ZeroBias_2017B_DataFrame_2D_17")
file_id = available_hdf.results[0].id

# TODO: Buggy generated code by swagger creates multiple definitions for start_parsing_histogram_data_file
# body = {'granularity':'lum', 'data_dimensionality':2, 'file_format':'csv'}
# success = api_instance.start_parsing_histogram_data_file(id=file_id, body=body)

# Manually do the request for now
r = requests.post(
    f"{MLP_URL}/api/histogram_data_files/{file_id}/start_parsing/",
    headers={"Content-Type": "application/json", "Authorization": f"Token {API_TOKEN}"},
    json={'granularity':'lum', 'data_dimensionality':2, 'file_format':'csv'}
)

In [None]:
print(f"Parsing for file {file_id} started: {r.ok}")

## Discovering new DQMIO files [Admins only]
If new files are added in the directories that DQMIO files reside, they won't be automatically discovered by DQM Playground without forcing a new discovery process. To do so:

In [None]:
# Manually do the request for now
r = requests.get(
    f"{MLP_DEVELOP_URL}/api/histogram_data_files/discover/",
    headers={"Content-Type": "application/json", "Authorization": f"Token {API_TOKEN}"},
)

In [None]:
print(f"Discovery started: {r.ok}")