# PSB2 Benchmark set

The full benchmark set can be downloaded from [this website](https://zenodo.org/records/5084812). For each problem task, there are two sets of input-output examples: `edge` is a set of some edge cases (roughly 10-40 instances) and `random` is set of a million instances. Due to space limitations, our `data.jl` problems only contain the edge sets for each problem. With this notebook you can convert all the instances from the `random` sets also to benchmark problems to use in Herb.jl.

First, we download the full benchmark set from the website, using their Python wrapper. To do this, we need the Python package `psb2` which only exists in Pip and not in Conda itself. 

In [None]:
using Conda, PyCall
using HerbData, HerbBenchmarks

Conda.pip_interop(true)
Conda.pip("install", "psb2")

psb2 = PyCall.pyimport("psb2")

Next, we create a function for writing creating data files per problem in the benchmark. This method downloads the JSON files into the `datasets` subfolder, which is ignored. For each problem, the `random` set contains one million examples.

For one problem, getting the full `random` set takes about five minutes. In total, there are 25 problems in the benchmark. Getting the `edge` sets for all benchmark problems takes about 5 minutes in total. 

In [None]:
function write_psb2_problems_to_file(problems::Vector{String}=String["fizz-buzz"], edge_or_random="random", n_train=200, n_test=2000, format="psb2")
    # If no specific problem specified, get all problems in the benchmark
    if isempty(problems) 
        problems = psb2.PROBLEMS
    end
    for name in problems
        if !(name in psb2.PROBLEMS)
            throw(ArgumentError("$(name) does not exist in the psb2 problems"))
        else
            # This loads the json files to the /datasets/<name> folder
            psb2.fetch_examples(pwd(), name, n_train, n_test, format)
            julia_name = replace(name, "-" => "_")
            # Reset the file if it exsits, so we can append the data all at once
            if isfile("$(pwd())/datasets/$(name)/$(julia_name)data.jl")
                rm("$(pwd())/datasets/$(name)/$(julia_name)data.jl")
            end
            parse_to_julia("$(pwd())/datasets/$(name)/", "$(name)-$(edge_or_random).json", PSB2_2021.parse_line_json, julia_name, "a")
        end
    end
end

In [None]:
write_psb2_problems_to_file()