# Azure Data University: mlos_bench SQLite data analysis (Student's workbook)

In this notebook, we look at the data from 100 trials we ran in `mlos_bench` to find a better SQLite configuration.

### 1. Data collection

We used the following commands in the integrated terminal of this codespace:

```sh
conda activate mlos

mlos_bench --config config/cli/local-sqlite-opt.jsonc \
           --globals config/experiments/sqlite-sync-journal-pagesize-caching-experiment.jsonc \
           --max-iterations 100
```

> See Also: [README.md](./README.md) for further instructions.

After letting it run for a few trials (it should take 10 to 15 minutes), we can start analyzing the autotuning data produced by the `mlos_bench` framework.

### 2. Import MLOS packages

In [None]:
# Import mlos_bench Storage API to access the experimental data.
from mlos_bench.storage import from_config

### 3. Connect to the DB using existing mlos_bench configs

We reuse the existing `mlos_bench` framework configuration file that contains the DB connection parameters.
This way we make sure to connect to the same database that our framework uses to store the experimental data.

In [None]:
storage = from_config(config_file="storage/sqlite.jsonc")

### 4. Load the data for our experiment

At the top level, Storage API has a single property, `.experiments` that returns a Python `dict` of key/value pairs of Experiment ID and Experiment Data.

In [None]:
storage.experiments

You should see a record for our experiment in the DB. Let's look at the data associated with it.

In [None]:
experiment_id = "sqlite-opt-demo"

### 5. Get all data for one experiment

In [None]:
exp = storage.experiments[experiment_id]
display(exp)
exp.objectives

In [None]:
# Display the set of optimization target objectives.
display(exp.objectives)

Main method that combines the information about each trial along with the trial configuration parameters and its results, is the property `.results`. It conveniently returns all data about the experiment is a single Pandas DataFrame.

In [None]:
df = exp.results_df

In [None]:
# TODO: Print the first 10 records of the results.

Each record of the DataFrame has the information about the trial, e.g., its timestamp and status, along with the configuration parameters (columns prefixed with `config.`) and the benchmark results (columns prefixed with `result.`). The `trial_id` field is simply the iteration number within the current experiment. Let's look at the first record to see all these fields.

In [None]:
# TODO: Print a single record of the `df` DataFrame

#### 5.1. Look at the data

We can think of each trial as a _noisy_ black-box function that has multiple inputs (that is, `config.*` parameters) and multiple outputs (the `result.*` fields). One of those outputs is designated as a target optimization metric. In our case, it's the DataFrame column named `result.90th Percentile Latency (microseconds)`, but we can reuse other outputs in different experiments (e.g., finding a configuration for maximizing throughput instead of minimizing latency).

The goal of our optimization process is to find input values (that is, the configuration) that minimize the output score, i.e., the 90th percentile query latency. The optimizer repeatedly proposes the new input values to efficiently explore the multi-dimensional configuration space and find the (global) optimum.

Of course, we can just blindly trust the optimizer and just use configuration it recommends as an optimum after some reasonably large series of trials; however, it is always a good idea to look at the data from all trials and try to better understand the behavior of the system and see how each configuration parameter impacts its performance. Such multi-dimensional data analysis is a daunting task, but looking at one or two dimensions at a time can already reveal a lot of information.

We'll do that in the sections below.

In [None]:
# TODO: Use Pandas API to print a few more records or columns of the data.
# Can you see the correlation between the configuration parameters and the results?
# Neither can we.

### 6. Visualize the results data automatically using `mlos_viz.plot()`

In [None]:
import mlos_viz

`mlos_viz` attempts to use the information about the data to automatically provide some basic visualizations without much effort on the developer's part.

At the moment, we do this using [`dabl`](https://github.com/dabl/dabl), though in the future we intend to add support for more interactive visualizations or even scheduling new trials, while maintaining a very simple API:

In [None]:
mlos_viz.plot(exp)

What can we learn from the visualizations?

### 7. Refocusing on a new region of the config space

After examing the results visualized above, you should see that a particular tunable seems to have influenced the results substantially.

What happens if you remove that tunable from the optimizer?

Adjust the configs and re-run the benchmark loop to run that new experiment.

Can we prewarm the optimizer with any of the previous results?

#### Reanalyze the new data

Try using the tabular APIs in addition to the `mlos_viz.plot()` APIs to compare the new and old results.

### 8. Outro

If you feel curious, please go ahead and play with the SQLite data in the cells below.

After that, please open other notebooks in this repository and explore the data you have collected in this class as well as the results from our MySQL optimization experiments:

* [**mlos_demo_sqlite.ipynb**](mlos_demo_sqlite.ipynb) - Use this notebook to analyze the data you've collected during this workshop.
* [**mlos_demo_mysql.ipynb**](mlos_demo_mysql.ipynb) - Look at the actual production data we've collected in serveral experiment for MySQL Server optimization on Azure.