# VEDA - Performance of archive run for California 2020

For a relatively large region, how long does it take to run the algorithm for a full year and put all outputs into S3?

In [None]:
# If you haven't installed the fireatlas code yet, uncomment the following line and run this cell.

# !pip install -e .. -q

# After this runs, restart the notebook kernel.

In [1]:
from fireatlas import FireTime
import pandas as pd

import hvplot.pandas

First we need to process the log file to get the timings per-section and per-function. This function is specifically for processing running.log for an archival run where the `t` in the output should refer to the `t` within `Fire_Forward`.

In [2]:
def prep_log_df(filepath):
    with open(filepath, "r") as f:
        log = f.readlines()

    t = None
    values = []
    section = None
    for l in log:
        if "Starting full run" in l:
            section = "Preprocess t"
            start_t = pd.Timestamp(l[:19])
        if "Done with preprocessing t" in l:
            end_t = pd.Timestamp(l[:19])
            values.append({"t": None, "func": "whole section", "section": section, "took": end_t - start_t})
            start_t = end_t
            section = "Preprocess region + t"
        if "Done with preprocessing region + t" in l:
            end_t = pd.Timestamp(l[:19])
            values.append({"t": None, "func": "whole section", "section": section, "took": end_t - start_t})
            start_t = end_t
            section = "Fire tracking"
        if "Fire tracking at" in l:
            t = FireTime.t2dt([eval(t) for t in l.split("at [")[1].split("]")[0].split(", ")])
        if "func:" in l:
            func_str, took_str = l.split("func:")[1].split("took: ")
            val_str, unit_str = took_str.split(" ")
            values.append({"t": t, "func": func_str.strip(), "section": section, "took": pd.to_timedelta(eval(val_str), unit=unit_str.strip("\n"))})
        if "func:Fire_Forward " in l:
            end_t = pd.Timestamp(l[:19])
            values.append({"t": None, "func": "whole section", "section": section, "took": end_t - start_t})
            start_t = end_t
            t = None
            section = "Save outputs"
        if "Done --" in l:
            end_t = pd.Timestamp(l[:19])
            values.append({"t": None, "func": "whole section", "section": section, "took": end_t - start_t})

    log_df = pd.DataFrame(values)
    log_df["sec"] = log_df.took.dt.total_seconds()
    return log_df

## Section timings

Let's start by exploring the sections to see which part is driving compute time. We have split the timings into "Preprocess t", "Preprocess region + t", "Fire tracking" and "Save outputs".

In [3]:
log_df = prep_log_df("/home/jovyan/fireatlas_nrt/running2.log")
log_df[log_df["func"] == "whole section"]

Unnamed: 0,t,func,section,took,sec
14,NaT,whole section,Preprocess t,0 days 00:04:03,243.0
2927,NaT,whole section,Preprocess region + t,0 days 00:15:22,922.0
8011,NaT,whole section,Fire tracking,0 days 01:22:31,4951.0
8019,NaT,whole section,Save outputs,0 days 00:16:03,963.0


We can drill in a little more to see which functions within each section are the most time consuming. Keep in mind that some of these timings are coming from dask workers, so `preprocess_region_t` for instance takes over two hours in system time, but the whole section only takes 15 minutes in wall time because we have 12 workers to divide work between.

In [4]:
log_df[log_df["func"] != "whole section"].groupby(["section", "func"]).sec.sum().round(0)

section                func                           
Fire tracking          Fire_Forward                       4951.0
                       Fire_Forward_one_step              4917.0
                       Fire_expand_rtree                  2768.0
                       Fire_merge_rtree                    438.0
                       read_preprocessed                    33.0
                       save_allfires_gdf                   324.0
                       save_allpixels                     1005.0
                       update_gdf                          364.0
Preprocess region + t  Fire_merge_rtree                      4.0
                       do_clustering                        29.0
                       find_largefires                       0.0
                       preprocess_region_t                7652.0
                       read_allpixels                        1.0
                       read_preprocessed_input             222.0
                       read_region 

## Time spent within functions of `Fire_Forward`

Since `Fire tracking` was the most consuming, let's aggregate to the month and see how long each part of `Fire_Forward` took at different times of the year.

In [5]:
log_df = prep_log_df("/home/jovyan/fireatlas_nrt/running2.log")
monthly = log_df.dropna().set_index("t").groupby("func").resample("ME").sum().drop(columns=["func"]).reset_index()

monthly[~monthly.func.isin(["Fire_Forward_one_step", "Fire_Forward"])].hvplot.bar(
    y="sec", x="t", by="func", stacked=True, rot=90, grid=True, width=1000, height=500
)

You can see that later in the year, `save_allpixels` starts taking up more time relative to `Fire_expand_rtree` this is because in this version of the code at the end of each `t` within `Fire_Forward` the `allpixels` data is being saved off. That is an easy place to trim time! We only need to do it at the end of `Fire_Forward`.