# UMAMI Demo

This notebook demonstrates how to generate an UMAMI plot from a .csv file that has been generated by the `summarize_job.py` script included with pytokio.

In [None]:
import os
import json
import pandas
import datetime
import tokio.tools.umami

We _must_ include `parse_dates` for the column which will be passed as each `UmamiMetric`'s timestamp.

In [None]:
df = pandas.read_csv('sample_summary.csv',
                     parse_dates=['_datetime_start', '_datetime_end'])

## Filter data

We can select jobs that only match certain criteria by performing boolean operators on the DataFrame to generate filters.  We can then combine these filters using the `&` operator.

In this example, we apply the following filters:

1. Select only jobs that did more writes than reads.  The `darshan_biggest_{read,write}_api_bytes` keys contain the total number of bytes read or written by any single I/O API (POSIX or MPI-IO).

2. Select only jobs that performed most of their I/O to the file system mounted at `/scratch2`.

In [None]:
### Select only jobs that did more writes than reads
filtered_indices = df['darshan_biggest_write_api_bytes'] > df['darshan_biggest_read_api_bytes']

### Select jobs in a specific time range
filtered_indices &= df['_datetime_start'] > datetime.datetime(2017, 2, 21)
filtered_indices &= df['_datetime_start'] < datetime.datetime(2017, 3, 3, 12, 0, 0)

### and also select only jobs whose I/O went to a file system mounted at a specific place.
### Note that we have to check the type because Pandas will load empty CSV values as NaN,
### which cannot be .startswith()ed.
filtered_indices &= [ type(x) == str and x.startswith('/scratch2') for x in df['darshan_biggest_write_fs'] ]


### Filter out dates where Darshan says we wrote more than Lustre thinks
filtered_indices &= df['darshan_total_gibs_posix'] <= df['lmt_tot_gibs_written']

We can now view a few example rows from our filtered data.  Note that we use `.T` to transpose the example rows just so you can see all of the metrics contained in this DataFrame.

In [None]:
df[filtered_indices].head().T

## Build the Umami object

We want to look at the following metrics:

1. `darshan_agg_perf_by_slowest_posix`, converted to GiB/s
2. `darshan_total_gibs_posix` / ( `lmt_tot_gibs_read` + `lmt_tot_gibs_written` )
3. `fshealth_ost_most_full_pct`
4. `lmt_ave_mds_cpu`
5. `lmt_max_oss_cpu`

For each one, we create an `UmamiMetric` object, then we build the `Umami` object from them.  `UmamiMetric` objects can be given anything list-like (i.e., can be sliced in exactly one dimension), or `pandas.Series` objects.  In this example, we are passing it `pandas.Series` objects that are pulled straight out of our `df` DataFrame.

Note that the following code is expanded out to make it easy to read; in practice, it's more concise to iteratively create and add new `UmamiMetric`s to the parent `Umami` object.

In [None]:
metric1 = tokio.tools.umami.UmamiMetric(
    timestamps=df[filtered_indices]['_datetime_start'],
    values=df[filtered_indices]['darshan_agg_perf_by_slowest_posix'] / 1024.0,
    label="Performance (GiB/sec)",
    big_is_good=True)

metric2 = tokio.tools.umami.UmamiMetric(
    timestamps=df[filtered_indices]['_datetime_start'],
    values=df[filtered_indices]['darshan_total_gibs_posix'] /
        (df[filtered_indices]['lmt_tot_gibs_read'] + df[filtered_indices]['lmt_tot_gibs_written']),
    label="Coverage Factor",
    big_is_good=True)

metric3 = tokio.tools.umami.UmamiMetric(
    timestamps=df[filtered_indices]['_datetime_start'],
    values=df[filtered_indices]['fshealth_ost_most_full_pct'],
    label="OST Fullness",
    big_is_good=False)

metric4 = tokio.tools.umami.UmamiMetric(
    timestamps=df[filtered_indices]['_datetime_start'],
    values=df[filtered_indices]['lmt_max_oss_cpu'],
    label="Highest OSS CPU Load",
    big_is_good=False)

metric5 = tokio.tools.umami.UmamiMetric(
    timestamps=df[filtered_indices]['_datetime_start'],
    values=df[filtered_indices]['lmt_ave_mds_cpu'],
    label="Average MDS CPU Load",
    big_is_good=False)

Construct an `Umami` object with the five metrics we defined above.

In [None]:
umami = tokio.tools.umami.Umami()
umami['performance'] = metric1
umami['cf'] = metric2
umami['fullness'] = metric3
umami['max_oss_cpu'] = metric4
umami['avg_mds_cpu'] = metric5


## Examine the UMAMI

The most obvious thing to do with an UMAMI is plot it.

In [None]:
fig = umami.plot()
fig

Its data can also be presented as a DataFrame.  Note that the `UmamiMetric` metadata (the label and whether bigger values are better or not) are lost when we do this.

In [None]:
umami.to_dataframe()

`Umami` and `UmamiMetric` objects also serialize reasonably well.  The parent object can be converted either to a dictionary or a json string.

In [None]:
print umami.to_dict()

In [None]:
print umami.to_json()