# Evaluate Baselines

The goal of this notebook is to demonstrate how we can evaluate the results of a baseline on a given benchmark.

It will be split into two part. The first part will focus on the evaluation of a baseline that does not requires any training (the `DCApproximatrionAC`). On the second part, we will show how to load a baseline (or any other `AugmentedSimulator`) and evaluate it on a `Benchmark` of our choice.

As for the first notebook, we demonstrate this capability for the case of `NeuripsBenchmark1`.

**NB** This notebook supposes that the data for the benchmark are already available. If they are not, please generate them or download them.

**NB** The `DCApproximatrionAC` baseline requires the `grid2op` python package.

#### Import required packages

In [1]:
import pathlib
from pprint import pprint
from lips.benchmark import PowerGridBenchmark

In [2]:
# indicate required paths
LIPS_PATH = pathlib.Path().resolve().parent
DATA_PATH = LIPS_PATH / "reference_data"
LOG_PATH = LIPS_PATH / "lips_logs.log"
CONFIG_PATH = LIPS_PATH / "lips" / "config" / "conf.ini"
BASELINES_PATH = LIPS_PATH / "trained_baselines"

#  Benchmark1

## Initial step: load the dataset

A common dataset will be used for evaluate the two augmented simulator. This initial step aims at loading it once and for all.

In [3]:
benchmark1 = PowerGridBenchmark(benchmark_name="Benchmark1",
                                benchmark_path=DATA_PATH,
                                load_data_set=True,
                                log_path=LOG_PATH,
                                config_path=CONFIG_PATH
                               )

In [4]:
# to verify the config is loaded appropriately for this benchmark
print("Benchmark name: ", benchmark1.config.section_name)
print("Environment name: ", benchmark1.config.get_option("env_name"))
print("Output attributes: ", benchmark1.config.get_option("attr_y"))
print("Evaluation criteria: ")
pprint(benchmark1.config.get_option("eval_dict"))

Benchmark name:  Benchmark1
Environment name:  l2rpn_case14_sandbox
Output attributes:  ('a_or', 'a_ex')
Evaluation criteria: 
{'IndRed': [],
 'ML': ['MSE_avg', 'MAE_avg', 'mape_90_avg'],
 'OOD': ['MSE_avg', 'MAE_avg', 'mape_90_avg'],
 'Physics': ['CURRENT_POS']}


## The DC approximation

We remind that the `grid2op` library is required for this part. You can install it with `pip install grid2op` if you do not have it already.

First we will create the "augmented simulator". As opposed to the second model we will expose here, this method require access to a powergrid. This is one of the reason we need grid2op. 

The way to load each `AugmentedSimulator` is specific. Here for example we load the DCApproximation that will use the same powergrid as the one used to generate the data in the previous Notebook.

In [5]:
# the next few lines are specific for each benchmark and each `AugmentedSimulator`
import grid2op
import warnings
from lips.physical_simulator.dcApproximationAS import DCApproximationAS
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    env = grid2op.make(benchmark1.config.get_option("env_name"), test=True)
    grid_path = pathlib.Path(env.get_path_env()) / "grid.json"

dc_sim = DCApproximationAS(name="dc_approximation", 
                           benchmark_name="Benchmark1",
                           config_path=None, # use default config path
                           grid_path=grid_path)

Now that the model is load, there is a common interface to evaluate its performance, on a dataset. This is showed in the cell bellow where we evaluate a physics based simulator `DCApproximation` on these two dataset.

In [None]:
dc_metrics_per_dataset = benchmark1.evaluate_simulator(augmented_simulator=dc_sim,
                                                       dataset="all" # other values : "val", "test", "test_ood_topo"
                                                      )

In [8]:
pprint(dc_metrics_per_dataset)

{'test': {'IndRed': {},
          'ML': {'MAE_avg': {'a_ex': 107.61548189784273,
                             'a_or': 84.97437099584019},
                 'MSE_avg': {'a_ex': 72934.1281314372,
                             'a_or': 49687.964377169934},
                 'mape_90_avg': {'a_ex': 0.14643601581756005,
                                 'a_or': 0.16305198078356278}},
          'Physics': {'CURRENT_POS': {}}},
 'test_ood_topo': {'IndRed': {},
                   'ML': {'MAE_avg': {'a_ex': 115.29069542441187,
                                      'a_or': 90.98358689257782},
                          'MSE_avg': {'a_ex': 84768.24530659153,
                                      'a_or': 57543.84358850798},
                          'mape_90_avg': {'a_ex': 0.16242165476667145,
                                          'a_or': 0.1796451285591881}},
                   'Physics': {'CURRENT_POS': {}}},
 'val': {'IndRed': {},
         'ML': {'MAE_avg': {'a_ex': 106.66064066851729,
          

And now it is possible to study the metrics on the different dataset. For example, if we want the "MSE" error on the "test" dataset (with a similar distribution as the training one):

## A learned baseline "augmented simulator"

Along with some dataset, we provide also some baseline (from a trained neural network). This baseline is made of a fully connected neural network that takes the available input of the powergrid and tries to predict all the output of the simulator.

The fully connected neural network is made of XXX layer each with YYY units.

It is learned for KKK epochs on the training set of the `Benchmark1`.

**NB** These baselines are not yet fully trained, and some hyper parameters still need to be optimized. We intend on doing that before the official release of the benchmark for the Neurips conference.

First we need to load the baseline and initialize it properly

In [13]:
from lips.augmented_simulators.fullyConnectedAS import FullyConnectedAS

# rebuild the baseline architecture
fc_augmented_sim = FullyConnectedAS(name="FullyConnectedAS",
                                    benchmark_name="Benchmark1",
                                    log_path=LOG_PATH
                                   )

# TODO create a wrapper for these 3 calls
fc_augmented_sim.load_metadata(BASELINES_PATH)
fc_augmented_sim.init()
fc_augmented_sim.restore(BASELINES_PATH)

Then, as for the DC approximation, we can evaluate it on the test datasets of the benchmark.

This is done with the same command:

In [14]:
fc_metrics_per_dataset = benchmark1.evaluate_simulator(augmented_simulator=fc_augmented_sim,
                                                       dataset="all",
                                                       batch_size=10000)

## Comparison of the two augmented simulator

### Machine learning metrics 

And now we can compare the two "augmented simulators". For example, if we want to compare the MAPE90 (mean absolute percentage error compute for last 10% quantile) on the test dataset (with a distribution similar to the training distribution) for currents (A) at two extremity of power lines, we might compare:

In [24]:
ML_metrics = "ML"
dataset_name = "test"
print("DC Approximation")
print(f"Dataset : {dataset_name}")
print("{:<10} : {}".format("MAPE90", dc_metrics_per_dataset[dataset_name][ML_metrics]["mape_90_avg"]))
print("{:<10} : {}".format("MSE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["MSE_avg"]))
print("{:<10} : {}".format("MAE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))
dataset_name = "test_ood_topo"
print(f"Dataset : {dataset_name}")
print("{:<10} : {}".format("mape_90_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["mape_90_avg"]))
print("{:<10} : {}".format("MSE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["MSE_avg"]))
print("{:<10} : {}".format("MAE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))

DC Approximation
Dataset : test
MAPE90     : {'a_or': 0.16305198078356278, 'a_ex': 0.14643601581756005}
MSE_avg    : {'a_or': 49687.964377169934, 'a_ex': 72934.1281314372}
MAE_avg    : {'a_or': 84.97437099584019, 'a_ex': 107.61548189784273}
Dataset : test_ood_topo
mape_90_avg : {'a_or': 0.1796451285591881, 'a_ex': 0.16242165476667145}
MSE_avg    : {'a_or': 57543.84358850798, 'a_ex': 84768.24530659153}
MAE_avg    : {'a_or': 90.98358689257782, 'a_ex': 115.29069542441187}


In [27]:
ML_metrics = "ML"
dataset_name = "test"
print("Fully Connected Augmented Simulator")
print(f"Dataset : {dataset_name}")
print("{:<10} : {}".format("mape_90_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["mape_90_avg"]))
print("{:<10} : {}".format("MSE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["MSE_avg"]))
print("{:<10} : {}".format("MAE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))
dataset_name = "test_ood_topo"
print(f"Dataset : {dataset_name}")
print("{:<10} : {}".format("mape_90_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["mape_90_avg"]))
print("{:<10} : {}".format("MSE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["MSE_avg"]))
print("{:<10} : {}".format("MAE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))

Fully Connected Augmented Simulator
Dataset : test
mape_90_avg : {'a_or': 0.007147388017315423, 'a_ex': 0.006962988795339291}
MSE_avg    : {'a_or': 10.514204978942871, 'a_ex': 23.054763793945312}
MAE_avg    : {'a_or': 1.9447065591812134, 'a_ex': 2.7763805389404297}
Dataset : test_ood_topo
mape_90_avg : {'a_or': 0.19447579884433752, 'a_ex': 0.19415184161617227}
MSE_avg    : {'a_or': 8378.7890625, 'a_ex': 14869.296875}
MAE_avg    : {'a_or': 40.592567443847656, 'a_ex': 56.149513244628906}


### Physic compliance

In [30]:
physic_compliances = "Physics"
dataset_name = "test"
current_violation = fc_metrics_per_dataset[dataset_name][physic_compliances]["CURRENT_POS"]["a_or"]["Violation_proportion"]
print("{:.2f}% of currents at the origin side of power lines violate the current positivity.".format(current_violation*100))

2.69% of currents at the origin side of power lines violate the current positivity.


In [31]:
current_error = fc_metrics_per_dataset[dataset_name][physic_compliances]["CURRENT_POS"]["a_or"]["Violation_proportion"]
print("The sum of negative current values (Amp) : {:.2f}".format(current_error))

The sum of negative current values (Amp) : 0.03


### Industrial readiness

In [32]:
fc_augmented_sim.predict_time

0.13227033615112305

In [33]:
dc_sim._predict_time

262.6392412185669

In [34]:
dc_sim._raw_grid_simulator.comp_time

138.35747891198844

# Benchmark2

In [24]:
from lips.benchmark import PowerGridBenchmark
path_benchmark = os.path.join("reference_data")
log_path = os.path.abspath(os.path.join("lips","logger","logs.log"))
benchmark2 = PowerGridBenchmark(benchmark_name="Benchmark2",
                                path_benchmark=path_benchmark,
                                load_data_set=True,
                                log_path=log_path
                               )

# Benchmark3 