# Checking and describing the generated data

It is always beneficial to add a notebook that quickly looks into the data to help you remember, which data you collected and if it actually looks correct.

In [38]:
from algbench import describe, read_as_pandas, Benchmark
from _conf import EXPERIMENT_DATA

In [39]:
describe(EXPERIMENT_DATA)

An entry in the database can look like this:
_____________________________________________
 result:
| num_nodes: 25
| lower_bound: 87238790.0
| objective: 87238790.0
 timestamp: 2023-11-15T18:15:30.526068
 runtime: 0.10997509956359863
 stdout: [[0.00709986686706543, 'Set parameter Username\n'], [0.012656927108764648, 'A...
 stderr: []
 logging: []
 env_fingerprint: 945241d9e68ff38bdacc4f4144ddffee9066bc03
 args_fingerprint: 00e90df2fbbd823b698220e9ecb0834d856a16cf
 parameters:
| func: run_solver
| args:
|| instance_name: random_euclidean_25_0
|| time_limit: 90
|| strategy: GurobiTspSolver
 argv: ['01_run_benchmark.py']
 env:
| hostname: pool-4-147.ibr.cs.tu-bs.de
| python_version: 3.10.8 (main, Nov 24 2022, 08:09:04) [Clang 14.0.6 ]
| python: /Users/krupke/Library/anaconda3/envs/mo310/bin/python3
| cwd: /Users/krupke/Repositories/cpsat-primer/examples/tsp_evaluation
| 
        # Unfortunately deprecated.
        "environment": [
            {
                "name": str(pkg.project_nam

In [40]:
t = read_as_pandas(
    EXPERIMENT_DATA,
    lambda entry: {
        "instance_name": entry["parameters"]["args"]["instance_name"],
        "num_nodes": entry["result"]["num_nodes"],
        "time_limit": entry["parameters"]["args"]["time_limit"],
        "strategy": entry["parameters"]["args"]["strategy"],
        "runtime": entry["runtime"],
        "objective": entry["result"]["objective"],
        "lower_bound": entry["result"]["lower_bound"],
    },
)
t["opt_gap"] = (t["objective"] - t["lower_bound"]) / t["lower_bound"]
t

Unnamed: 0,instance_name,num_nodes,time_limit,strategy,runtime,objective,lower_bound,opt_gap
0,random_euclidean_25_0,25,90,GurobiTspSolver,0.109975,8.723879e+07,8.723879e+07,0.000000
1,random_euclidean_25_0,25,90,CpSatTspSolverV1,0.174499,8.723879e+07,8.723879e+07,0.000000
2,random_euclidean_25_1,25,90,GurobiTspSolver,0.030855,1.115998e+08,1.115998e+08,0.000000
3,random_euclidean_25_1,25,90,CpSatTspSolverV1,0.145274,1.115998e+08,1.115998e+08,0.000000
4,random_euclidean_25_2,25,90,GurobiTspSolver,0.036049,9.958995e+07,9.958995e+07,0.000000
...,...,...,...,...,...,...,...,...
102,random_euclidean_200_1,200,90,GurobiTspSolver,21.167487,7.533585e+07,7.533585e+07,0.000000
103,random_euclidean_200_1,200,90,CpSatTspSolverV1,93.967414,9.746496e+07,7.377529e+07,0.321106
104,random_euclidean_200_2,200,90,GurobiTspSolver,10.534419,7.404569e+07,7.404569e+07,0.000000
105,random_euclidean_200_2,200,90,CpSatTspSolverV1,94.380089,8.205347e+07,7.378950e+07,0.111994


## Check for errors in the data

You always want to check if the results you got are actually feasible. Errors easily happen and are not always visible on the plots.
Thus, you want to do some basic checks to detect errors early on. For example, you could accidentally have swapped lower and upper bounds in the data generation process.
Depending on your plots, this may not be visible, and you may end up comparing the wrong data and draw the wrong conclusions.
Or, you could have accidentally swapped runtime and objective values, which could look reasonable in the data as the runtime and the objective often increase with the instance size.

A very basic check is to check if the best lower and upper bounds do not contradict each other. Many errors will be caught by this check. However, you often need some tolerance to account for numerical errors.

In [41]:
assert (t.dropna()["opt_gap"]>=-0.0001).all(), "Optimality gap is negative!"

In [42]:
# Always make sure that your results are not trivially wrong
#  - e.g. lower bound is higher than objective
max_lb = t.groupby(["instance_name"])["lower_bound"].max()
min_obj = t.groupby(["instance_name"])["objective"].min()
eps = 0.0001  # some tolerance is needed when working with floats.
bad_instances = max_lb[max_lb-min_obj>eps*max_lb].index.to_list()
assert len(bad_instances) == 0, "Bad instances detected: {}".format(bad_instances)
# t[t["instance_name"].isin(bad_instances)]