# Evaluate Baselines

The goal of this notebook is to demonstrate how we can evaluate the results of a baseline on a given benchmark.

It will be split into two part. The first part will focus on the evaluation of a baseline that does not requires any training (the `DCApproximatrionAC`). On the second part, we will show how to load a baseline (or any other `AugmentedSimulator`) and evaluate it on a `Benchmark` of our choice.

As for the first notebook, we demonstrate this capability for the case of `NeuripsBenchmark1`.

**NB** This notebook supposes that the data for the benchmark are already available. If they are not, please generate them or download them.

**NB** The `DCApproximatrionAC` baseline requires the `grid2op` python package.

#### Import required packages

In [1]:
import os
from lips.benchmark import PowerGridBenchmark

#  Benchmark1

## Initial step: load the dataset

A common dataset will be used for evaluate the two augmented simulator. This initial step aims at loading it once and for all.

In [2]:
benchmark_path = os.path.abspath(os.path.join(os.path.pardir, "reference_data"))
log_path = os.path.abspath("logs.log")
benchmark1 = PowerGridBenchmark(benchmark_name="Benchmark1",
                                benchmark_path=benchmark_path,
                                load_data_set=True,
                                log_path=log_path
                               )

## The DC approximation

We remind that the `grid2op` library is required for this part. You can install it with `pip install grid2op` if you do not have it already.

First we will create the "augmented simulator". As opposed to the second model we will expose here, this method require access to a powergrid. This is one of the reason we need grid2op. 

The way to load each `AugmentedSimulator` is specific. Here for example we load the DCApproximation that will use the same powergrid as the one used to generate the data in the previous Notebook.

In [3]:
# the next few lines are specific for each benchmark and each `AugmentedSimulator`
import grid2op
import warnings
from lips.physical_simulator.dcApproximationAS import DCApproximationAS
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    env = grid2op.make("l2rpn_case14_sandbox", test=True)
    grid_path = os.path.join(env.get_path_env(), "grid.json")

dc_sim = DCApproximationAS(name="dc_approximation", 
                           benchmark_name="Benchmark1",
                           path_config=None, # use default config path
                           grid_path=grid_path)

Now that the model is load, there is a common interface to evaluate its performance, on a dataset. This is showed in the cell bellow where we evaluate a physics based simulator `DCApproximation` on these two dataset.

In [4]:
dc_metrics_per_dataset = benchmark1.evaluate_simulator(augmented_simulator=dc_sim,
                                                       dataset="test"
                                                      )

evaluate dc: 100%|██████████| 10000/10000 [08:16<00:00, 20.14it/s]


And now it is possible to study the metrics on the different dataset. For example, if we want the "MSE" error on the "test" dataset (with a similar distribution as the training one):

## A learned baseline "augmented simulator"

Along with some dataset, we provide also some baseline (from a trained neural network). This baseline is made of a fully connected neural network that takes the available input of the powergrid and tries to predict all the output of the simulator.

The fully connected neural network is made of XXX layer each with YYY units.

It is learned for KKK epochs on the training set of the `Benchmark1`.

**NB** These baselines are not yet fully trained, and some hyper parameters still need to be optimized. We intend on doing that before the official release of the benchmark for the Neurips conference.

First we need to load the baseline and initialize it properly

In [None]:
path_baselines = os.path.join("trained_baselines")
from lips.augmented_simulators.fullyConnectedAS import FullyConnectedAS

# recreate the baseline
fc_augmented_sim = FullyConnectedAS(name="FullyConnectedAS",
                                    benchmark_name="Benchmark1",
                                    log_path=log_path
                                   )

# TODO create a wrapper for these 3 calls
fc_augmented_sim.load_metadata(path_baselines)
fc_augmented_sim.init()
fc_augmented_sim.restore(path_baselines)

Then, as for the DC approximation, we can evaluate it on the test datasets of the benchmark.

This is done with the same command:

In [None]:
fc_metrics_per_dataset = benchmark1.evaluate_simulator(fc_augmented_sim, batch_size=10000)

## Comparison of the two augmented simulator

### Machine learning metrics 

And now we can compare the two "augmented simulators". For example, if we want to compare the MAPE90 (mean absolute percentage error compute for last 10% quantile) on the test dataset (with a distribution similar to the training distribution) for currents (A) at two extremity of power lines, we might compare:

In [7]:
dataset_name = "test"
ML_metrics = 0
print("{:<10} : {}".format("MAPE90", dc_metrics_per_dataset[dataset_name][ML_metrics]["mape90"]))
print("{:<10} : {}".format("NRMSE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["NRMSE_avg"]))
print("{:<10} : {}".format("MAE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))

MAPE90     : {'a_or': 0.16308450692718962, 'a_ex': 0.1464899336666497}
NRMSE_avg  : {'a_or': 0.08072826942694072, 'a_ex': 0.07632464035218221}
MAE_avg    : {'a_or': 84.77337833139086, 'a_ex': 107.36099085275112}


In [8]:
dataset_name = "test"
ML_metrics = 0
print("{:<10} : {}".format("MAPE90", fc_metrics_per_dataset[dataset_name][ML_metrics]["mape90"]))
print("{:<10} : {}".format("NRMSE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["NRMSE_avg"]))
print("{:<10} : {}".format("MAE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))

MAPE90     : {'a_or': 0.007024117742886628, 'a_ex': 0.006871168089630504}
NRMSE_avg  : {'a_or': 0.0030290880240499973, 'a_ex': 0.002971998881548643}
MAE_avg    : {'a_or': 1.9451725482940674, 'a_ex': 2.78893780708313}


If we want the same quantity but for the "out of distribution (due to topology)" distribution we can have a look at:

In [9]:
dataset_name = "test_ood_topo"
print("{:<10} : {}".format("MAPE90", dc_metrics_per_dataset[dataset_name][ML_metrics]["mape90"]))
print("{:<10} : {}".format("NRMSE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["NRMSE_avg"]))
print("{:<10} : {}".format("MAE_avg", dc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))

MAPE90     : {'a_or': 0.17918059033390996, 'a_ex': 0.16180186468334115}
NRMSE_avg  : {'a_or': 0.060954326726434374, 'a_ex': 0.057830270537688924}
MAE_avg    : {'a_or': 90.20692764002857, 'a_ex': 114.34817311476654}


In [10]:
dataset_name = "test_ood_topo"
print("{:<10} : {}".format("MAPE90", fc_metrics_per_dataset[dataset_name][ML_metrics]["mape90"]))
print("{:<10} : {}".format("NRMSE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["NRMSE_avg"]))
print("{:<10} : {}".format("MAE_avg", fc_metrics_per_dataset[dataset_name][ML_metrics]["MAE_avg"]))

MAPE90     : {'a_or': 0.19454041454560483, 'a_ex': 0.1942612925586953}
NRMSE_avg  : {'a_or': 0.053016722202301025, 'a_ex': 0.05295873433351517}
MAE_avg    : {'a_or': 40.29703903198242, 'a_ex': 55.750465393066406}


### Physic compliance

In [11]:
physic_compliances = 1
current_violation = fc_metrics_per_dataset["test"][physic_compliances]["BasicVerifications"]["currents"]["a_or"]["Violation_proportion"]
print("{:.2f}% of currents at the origin side of power lines violate the current positivity.".format(current_violation*100))

2.67% of currents at the origin side of power lines violate the current positivity.


In [12]:
current_error = fc_metrics_per_dataset["test"][physic_compliances]["BasicVerifications"]["currents"]["a_or"]["Error"]
print("The sum of negative current values (Amp) : {:.2f}".format(current_error))

The sum of negative current values (Amp) : 8368.78


### Industrial readiness

In [13]:
fc_augmented_sim.predict_time

0.06155872344970703

In [14]:
dc_sim._predict_time

122.04796147346497

In [15]:
dc_sim._raw_grid_simulator.comp_time

70.2608094215393

# Benchmark2

In [24]:
from lips.benchmark import PowerGridBenchmark
path_benchmark = os.path.join("reference_data")
log_path = os.path.abspath(os.path.join("lips","logger","logs.log"))
benchmark2 = PowerGridBenchmark(benchmark_name="Benchmark2",
                                path_benchmark=path_benchmark,
                                load_data_set=True,
                                log_path=log_path
                               )

# Benchmark3 