# The Bureaucrat

This notebook aims to present [The Bureaucrat](https://github.com/SengerM/the_bureaucrat), a package to help you dealing with the bureaucracy of storing your data consistently in a directory structure in an automated and scalable way.

You'll need to install a few things:
```
pip install git+https://github.com/SengerM/the_bureaucrat
pip install plotly
pip install pandas
```

In [None]:
# To install things just run this cell.
%pip install git+https://github.com/SengerM/the_bureaucrat
%pip install plotly
%pip install pandas

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
%autosave 0

from the_bureaucrat.bureaucrats import RunBureaucrat # pip install git+https://github.com/SengerM/the_bureaucrat
import plotly.express as px # pip install plotly
import pandas # pip install pandas
from pathlib import Path
import datetime
import time
import warnings
import shutil

warnings.filterwarnings("ignore", message="Cannot create backup of script*")

def unique_timestamp():
    time.sleep(1) # This is to ensure that no two timestamps are the same.
    return datetime.datetime.now().strftime("%Y%m%d%H%M%S")

STORE_DATA_HERE_PATH = Path.home()/'deleteme'
def clean_runs():
    for p in STORE_DATA_HERE_PATH.iterdir():
        shutil.rmtree(p)

# Introduction

- This package was designed with a tree-like structure in mind.
- Defines two kinds of objects:
  1. **Run**: Each node of the tree.
  2. **Task**: Things that you do within each run.
    - Tasks can have *sub-runs*, here arises the tree structure.
- These objects end up being just directories in the file system in the end.
- In the code
  - **Run**s are handled by `RunBureaucrat` objects.
  - **Task**s are handled by `TaskBureaucrat` objects.

In an image:

<img src="img/diagram.svg" align="left">

## Toy system to work with

Let's consider a toy example with a black box that has two inputs and one output:

<img src="img/black_box.svg" align="left">

In [None]:
from black_box import measure_black_box

The prototype of this toy function is 
```python
def measure_black_box(A:float, B:float)->float:
```

# Basic usage

First, create a `RunBureaucrat`. It is very simple:
```python
my_bureaucrat = RunBureaucrat(path_to_the_run)
```
The last part of `path_to_the_run` will be assumed to be the *name of the run*.

In [None]:
run_name = f'your_favourite_name'
John = RunBureaucrat(STORE_DATA_HERE_PATH/run_name)

The previous cell created a `RunBureaucrat` within Python but nothing else was done, no directory was created.

In [None]:
John.create_run()
print(f"The run was created in {John.path_to_run_directory}")

A *run directory* was created. We know it is a *run directory* because there is a file `bureaucrat_run_info.txt`.

Now we can perform a *Task* within this run, we ask `John` to handle it to one of his employees:

In [None]:
A = 0
B = 0
with John.handle_task('measure_with_constant_A_and_B') as Johns_employee:
    measurements_result = [measure_black_box(A=A,B=A) for n in range(99)]
    
    measurements_df = pandas.DataFrame({'black_box_output': measurements_result})
    measurements_df['A'] = A
    measurements_df['B'] = B
    
    measurements_df.to_csv(Johns_employee.path_to_directory_of_my_task/'data.csv', index=False)

print(f'Results of this task can be found in {Johns_employee.path_to_directory_of_my_task}')

In the previous cell:
- `John` is a `RunBureaucrat` managing the *Run*, he is the boss.
- `Johns_employee` is a `TaskBureaucrat` managing a *Task* within the *Run*, he works for `John`.
- The `with` statement ensures that
  1. A directory for the *Task* is created within the *Run*.
  2. If any exception happens during the handling of this task, you (and other *bureaucrats*) will know it in the future.
  3. If the *Task* is completed successfully, you (and other *bureaucrats*) will know it.

Let's now plot this data, with a new task:

In [None]:
John.check_these_tasks_were_run_successfully(['measure_with_constant_A_and_B']) # If not, this raises an error.

with John.handle_task('plot_at_constant_A_and_B') as Johns_employee:
    measured_data_df = pandas.read_csv(Johns_employee.path_to_directory_of_task('measure_with_constant_A_and_B')/'data.csv')
    fig = px.histogram(measured_data_df, x='black_box_output')
    fig.write_html(str(Johns_employee.path_to_directory_of_my_task/'plot.html'))

print(f'Results of this task can be found in {Johns_employee.path_to_directory_of_my_task}')

Note how paths to files are specified: **We always ask to the respective bureaucrat for the path to somewhere** and we only add the last part.

## What happens if there is an error during a task?

Go to the cell doing the task `'measure_with_constant_A_and_B'` and purposely introduce any error inside the `with`. Then try to execute the cell with the task `'plot_at_constant_A_and_B'` and see what happens.

In [None]:
clean_runs()

# Deeper example

Let's see now how the tree-like structure can be exploited.

In [None]:
def measure_with_constant_A_and_B(bureaucrat:RunBureaucrat, A:float, B:float, n_measurements:int):
    Alberto = bureaucrat # Let's give him a propper name.
    with Alberto.handle_task('measure_with_constant_A_and_B') as Albertos_employee:
        measurements_result = [measure_black_box(A=A,B=B) for n in range(n_measurements)]

        measurements_df = pandas.DataFrame({'black_box_output': measurements_result})
        measurements_df['A'] = A
        measurements_df['B'] = B

        measurements_df.to_csv(Albertos_employee.path_to_directory_of_my_task/'data.csv', index=False)

def measure_sweeping_A(bureaucrat:RunBureaucrat, As:list, B:float, n_measurements:int):
    Natalia = bureaucrat # We give her a propper name.
    with Natalia.handle_task('measure_sweeping_A') as Natalias_employee:
        for A in As:
            boss_in_sub_office = Natalias_employee.create_subrun(
                subrun_name = f'{Natalias_employee.run_name}_A{A}'
            )
            boss_in_sub_office.create_run()
            measure_with_constant_A_and_B(
                bureaucrat = boss_in_sub_office,
                A = A,
                B = B,
                n_measurements = n_measurements,
            )

I have defined two functions
1. One to measure with constant `A` and `B`.
2. One that calls the other while sweeping `A`.

In [None]:
Matthias = RunBureaucrat(STORE_DATA_HERE_PATH/'my_run')
Matthias.create_run()
measure_sweeping_A(
    bureaucrat = Matthias,
    As = [0,1,2,3],
    B = .8,
    n_measurements = 99,
)
print(f'Find the results in {Matthias.path_to_directory_of_task("measure_sweeping_A")}')

Now let's process (well, actually just plot) the data exploiting the tree structure.

In [None]:
def read_data_constant_A_and_B(bureaucrat:RunBureaucrat):
    Pepe = bureaucrat
    Pepe.check_these_tasks_were_run_successfully(['measure_with_constant_A_and_B'])
    return pandas.read_csv(Pepe.path_to_directory_of_task('measure_with_constant_A_and_B')/'data.csv')

def read_data_sweeping_A(bureaucrat:RunBureaucrat):
    Nahuel = bureaucrat
    Nahuel.check_these_tasks_were_run_successfully(['measure_sweeping_A'])
    loaded_data = []
    for subrun_bureaucrat in Nahuel.list_subruns_of_task('measure_sweeping_A'):
        df = read_data_constant_A_and_B(subrun_bureaucrat)
        loaded_data.append(df)
    return pandas.concat(loaded_data)

def plot_sweeping_A(bureaucrat:RunBureaucrat):
    Marta = bureaucrat
    Marta.check_these_tasks_were_run_successfully(['measure_sweeping_A'])
    data_df = read_data_sweeping_A(Marta)
    with Marta.handle_task('plot_sweeping_A') as Martas_employee:
        fig = px.ecdf(data_df, x='black_box_output', color='A')
        fig.write_html(Martas_employee.path_to_directory_of_my_task/'plot.html')

Note how each function just receives a `RunBureaucrat` and it has everything it needs to do its job.

In [None]:
plot_sweeping_A(Matthias)
print(f'Find the results in {Matthias.path_to_directory_of_task("plot_sweeping_A")}')

In [None]:
clean_runs()

## Yet more complexity, trivially

Let's now sweep `B` too. It is very easy because we already have the functions that sweep `A`, so we just write a set of functions that sweep `B` and call the previous functions that will sweep `A`:

In [None]:
def measure_sweeping_A_and_B(bureaucrat:RunBureaucrat, As:list, Bs:list, n_measurements:int):
    Natalia = bureaucrat # We give her a propper name.
    with Natalia.handle_task('measure_sweeping_A_and_B') as Natalias_employee:
        for B in Bs:
            boss_in_sub_office = Natalias_employee.create_subrun(
                subrun_name = f'{Natalias_employee.run_name}_B{B}'
            )
            boss_in_sub_office.create_run()
            measure_sweeping_A(
                bureaucrat = boss_in_sub_office,
                As = As,
                B = B,
                n_measurements = n_measurements,
            )

def read_data_sweeping_A_and_B(bureaucrat:RunBureaucrat):
    Nahuel = bureaucrat
    Nahuel.check_these_tasks_were_run_successfully(['measure_sweeping_A_and_B'])
    loaded_data = []
    for subrun_bureaucrat in Nahuel.list_subruns_of_task('measure_sweeping_A_and_B'):
        df = read_data_sweeping_A(subrun_bureaucrat)
        loaded_data.append(df)
    return pandas.concat(loaded_data)

def plot_sweeping_A_and_B(bureaucrat:RunBureaucrat):
    Celestino = bureaucrat
    Celestino.check_these_tasks_were_run_successfully(['measure_sweeping_A_and_B'])
    with Celestino.handle_task('plot_sweeping_A_and_B') as Celestinos_employee:
        data_df = read_data_sweeping_A_and_B(Celestino)
        fig = px.ecdf(
            data_df,
            x = 'black_box_output',
            color = 'A',
            facet_col = 'B',
        )
        fig.write_html(Celestinos_employee.path_to_directory_of_my_task/'plot.html')

In [None]:
Matthias = RunBureaucrat(STORE_DATA_HERE_PATH/'run_sweeping_both')
Matthias.create_run()
measure_sweeping_A_and_B(
    bureaucrat = Matthias,
    As = [0,1,2,3],
    Bs = [0,.8,1],
    n_measurements = 99,
)
print(f'Find the results in {Matthias.path_to_directory_of_task("measure_sweeping_A_and_B")}')

In [None]:
plot_sweeping_A_and_B(Matthias)
print(f'Find the results in {Matthias.path_to_directory_of_task("plot_sweeping_A_and_B")}')

In [None]:
clean_runs()

# Other features

## Warnings when weird characters are found

It will warn you if your paths contain characters that are better to avoid in order to maximize the cross-platform-ability:

In [None]:
Ludwig = RunBureaucrat(STORE_DATA_HERE_PATH/'I have spaces and also + and ? and *')
Ludwig.create_run()
print(f'Run was created in {Ludwig.path_to_run_directory}')

In Windows it will for shure have problems with `*`. So Ludwig is warning us.

## Automatic script backups

Each time a task is performed, e.g. `with John.handle_task('task_name') as John_employee:`, a backup of the script file will be placed in the task folder so you can later on remember how you did things in case you forget. 

This feature can be disabled
```python
with John.handle_task('task_name', backup_this_python_file=False) as John_employee:
    # Dos whatever you want here.
```

*Note*: At the time of writing this, this feature does not work with Jupyter Notebooks, but it does with regular scripts.

## Automatic deletion of old data

When a task is performed, by default the old data (if any) is deleted from the task directory. This ensures that when the task finishes only fresh data is present.

This behavior can be changed e.g. if you want to append data just with
```python
with John.handle_task('task_name', drop_old_data=False) as John_employee:
    # Do whatever you want here.
```
By default `drop_old_data=True`.

# Conclusions

- [The Bureaucrat](https://github.com/SengerM/the_bureaucrat) provides a *pure Python* and cross platform mechanism to accomodate data and results in a tree-like structure.
- Very flexible: It just creates a directory structure in the file system; you can then store whatever you want (automatically or manually), copy-paste parts of the tree, share, etc.
- The underlying directory structure is encoded into *Task*s that are done within *Run*s.
- Scalable: Thanks to its tree oriented design it is very easy to scale things up, as shown with the examples.

In [None]:
shutil.rmtree(STORE_DATA_HERE_PATH)