## Script to generate data for experiment2.jl
The original script generated was quite confused (and confusing) in terms of the number of the number of tasks/batches/minibatches. Originally the script created 192 tasks split up into 24 batches. This calculation comes from an input of "2^5 (32) tasks" multiplied by 25 (actually 24) batches, where a batch is an arbitrary way to split the training dataset into separate files. In this implementation, the training dataset is simply split into one HDF5 file of 192 tasks, so the minibatches can be dealt with when training the model.

In [1]:
# Run the script in parallel
using Distributed

# Add processes
rmprocs(workers()) # This will remove all worker processes
n_workers = 8
addprocs(n_workers) # Change this to the number of cores you want to use

[33m[1m└ [22m[39m[90m@ Distributed /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.6/Distributed/src/cluster.jl:1041[39m


8-element Vector{Int64}:
 2
 3
 4
 5
 6
 7
 8
 9

In [2]:
@everywhere begin
    using Pkg
    Pkg.activate(".")
    Pkg.instantiate()
    #Pkg.status()
end

[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`


      From worker 7:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`
      From worker 8:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`
      From worker 4:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`
      From worker 9:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`
      From worker 3:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`
      From worker 2:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`
      From worker 5:	[32m[1m  Activating[22m[39m environment at `~/github/DifferentiableUserModels-JT/Project.toml`


In [16]:
@everywhere begin
    using ArgParse
    using BSON
    using Distributions
    using Flux
    using Stheno
    using Tracker
    using Printf
    using HDF5
    using SharedArrays
end

In [4]:
@everywhere include(joinpath(@__DIR__, "NeuralProcesses.jl/src/NeuralProcesses.jl"))

[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m
[33m[1m└ [22m[39m[90m@ Base.Docs docs/Docs.jl:240[39m


In [5]:
@everywhere using .NeuralProcesses

In [8]:
# Make a dictionary to just use the default arguments from the argument parser
# Not all of these are used in this script
@everywhere begin
    function get_default_args()
        defaults = Dict(
            "gen" => "menu_search",
            "n_traj" => 0,
            "n_epochs" => 50,
            "n_batches" => 25,
            "batch_size" => 4,
            "params" => false,
            "p_bias" => 0.0,
            "bson" => "",
            "epsilon" => 0.0
        )
        return defaults
    end
    
    args = get_default_args()
end

In [9]:
# Don't bother initializing the model
# println("Initializing model...")

# model = anp_ex2(
#     dim_embedding=128,
#     num_encoder_heads=8,
#     num_encoder_layers=6,
#     num_decoder_layers=6,
#     args=args
# ) |> gpu


In [10]:
# Don't bother initializing the loss
# println("Initializing loss...")

# loss(xs...) = np_elbo(
#     xs...,
#     num_samples=5,
#     fixed_σ_epochs=3
# )

In [11]:
# Make the data generator
@everywhere begin
    println("Initializing data generator")
    
    batch_size  = args["batch_size"]
    
    # Redundant. Required to fit the DataGenerator definition
    x_context = Distributions.Uniform(-2, 2)
    x_target  = Distributions.Uniform(-2, 2)
    
    num_context = Distributions.DiscreteUniform(10, 10)
    num_target  = Distributions.DiscreteUniform(10, 10)
    
    data_gen = NeuralProcesses.DataGenerator(
                    SearchEnvSampler(args;),
                    batch_size=batch_size,
                    x_context=x_context,
                    x_target=x_target,
                    num_context=num_context,
                    num_target=num_target,
                    σ²=1e-8
                )
    println("Data gen initialized")
end

Initializing data generator
Data gen initialized
      From worker 4:	Initializing data generator
      From worker 4:	Data gen initialized
      From worker 3:	Initializing data generator
      From worker 9:	Initializing data generator
      From worker 9:	Data gen initialized
      From worker 5:	Initializing data generator
      From worker 7:	Initializing data generator
      From worker 8:	Initializing data generator
      From worker 6:	Initializing data generator
      From worker 2:	Initializing data generator
      From worker 5:	Data gen initialized
      From worker 6:	Data gen initialized
      From worker 8:	Data gen initialized
      From worker 3:	Data gen initialized
      From worker 7:	Data gen initialized
      From worker 2:	Data gen initialized


In [56]:
# Generate the data in a parallel way. The vector "data" will be the dataset from
# all 192 tasks

tasks_per_epoch = 16

# Function to help print output in realtime in jupyter notebooks

println("Starting generating data with $n_workers workers")
data = @distributed (vcat) for task_n in 1:tasks_per_epoch;
    println("Starting task $task_n")
    flush(stdout)
    
    # Generate data
    data = gen_batch(data_gen, 1; eval=false)
    #data = gen_batch(data_gen, tasks_per_worker; eval=false)
    
    # Return the data from this worker to the "big" data array above
    data;
end

println("Finished generating data")

Starting generating data with 8 workers
      From worker 5:	Starting task 7
      From worker 3:	Starting task 3
      From worker 2:	Starting task 1
      From worker 9:	Starting task 15
      From worker 8:	Starting task 13
      From worker 7:	Starting task 11
      From worker 6:	Starting task 9
      From worker 4:	Starting task 5
      From worker 2:	Starting task 2
      From worker 9:	Starting task 16
      From worker 5:	Starting task 8
      From worker 7:	Starting task 12
      From worker 4:	Starting task 6
      From worker 3:	Starting task 4
      From worker 8:	Starting task 14
      From worker 6:	Starting task 10
Finished generating data


In [57]:
# Add multiple pieces of metadata to the dataset   
metadata = Dict(
"gen_type" => "SearchEnvSampler / menu_search",
"tasks_per_epoch" => tasks_per_epoch,
"eval" => false,
"batch_size" => batch_size,
"n_traj" => "random(1-8), although this doesn't seem to be used", #This is what happens when it's set to 0 in args dictionary
"noise_variance" => 1e-8,
"p_bias" => 0.0
)

# Function to save the data as HDF5
function create_hdf5_ex2(data, filename, metadata)
    # Open the HDF5 file for writing, overwriting if it exists
    h5open(filename, "w") do fid
        # Loop over the data vector
        for (i, d) in enumerate(data)
            # Create a group for each mini-batch
            g = create_group(fid, "task_$i")

            # Add datasets to the group
            g["xc"] = d[1]
            g["yc"] = d[2]
            g["xt"] = d[3]
            g["yt"] = d[4]

            # Add metadata to the group
            for (key, value) in metadata
                write_attribute(g, key, value)
            end
        end
    end
end


create_hdf5_ex2 (generic function with 1 method)

In [58]:
# Save the data!

filepath = "data/ex2/experiment2_data.hdf"
create_hdf5_ex2(data,filepath,metadata)
println("File saved successfully")

File saved successfully


In [None]:
# Don't bother training the model
# println("Proceeding to training loop...")

# mkpath("models/"*string(args["bson"]))

# train_model!(
#         model,
#         loss,
#         data_gen,
#         ADAM(5e-4),
#         bson=args["bson"],
# 	experiment=args["gen"],
#         starting_epoch=0,
#         tasks_per_epoch=2^5,
#         batches=args["n_batches"],
# 	total_epochs=args["n_epochs"],
#         epsilon=args["epsilon"]
#     )