Tooling for a make.jl file in the project root #315

goerz · 2022-01-10T23:46:41Z

As explained in https://discourse.julialang.org/t/julia-based-makefile-replacement-for-research-workflows/74368, it might be nice to have some tooling that would allow me to write a script make.jl in the root of my research project that would automatically run the scripts to generate any missing output files or plots.

I realize this might a stretch, but it wouldn't seem entirely out-of-scope for a project like DrWatson to include something that takes a Dict defining a map of target files to dependencies and runs some code to produce those targets automatically.

The text was updated successfully, but these errors were encountered:

Datseris · 2022-01-11T10:29:44Z

This seems like it boils down to having a bunch of .jl files that use produce_or_load, and then a make.jl file what just does incluide( all files ). My question is, what do you do if for the same .jl file you'd want to run this with different parameters?

goerz · 2022-01-11T12:05:47Z

I'm not sure we're exactly on the same page :-)

What I'm looking for is a way to automatically run the .jl files. Like a make.jl file that contains

RULES = [
    ScriptRule(
        [datadir("oct", savename("baseline_oct_result", Dict(:iter_stop=>100, :ν_max=>0.0), "jld2"))],
        scriptsdir("2022-01-09_baseline.jl")
    )
]

where the 2022-01-09_baseline.jl script would be run automatically run if the jld2 doesn't exist. The script itself in this case still has produce_or_load, but that only takes care of not unnecessarily re-creating the jld2 file if it already exists. It's somewhat redundant if someone is using the make.jl, but still important if they call the script directly.

The above snippet is currently in https://github.com/goerz-research/2022-01_Rydberg_Krotov_Spectral_Constraints, see also https://discourse.julialang.org/t/julia-based-makefile-replacement-for-research-workflows/74368/4?u=goerz

That system could actually specifically handle the situation of running the same .jl file with different arguments that you mention: You could have, e.g.,

RULES = [
    ScriptRule(
        [datadir("oct", savename("baseline_oct_result", Dict(:iter_stop=>100, :ν_max=>0.1), "jld2"))],
        scriptsdir("2022-01-09_baseline.jl"),
        args=[0.1]
    ), 
    ScriptRule(
        [datadir("oct", savename("baseline_oct_result", Dict(:iter_stop=>100, :ν_max=>0.5), "jld2"))],
        scriptsdir("2022-01-09_baseline.jl"),
        args=[0.5]
    )
]

under the assumption that the script takes the value for ν_max as an argument (I haven't actually implemented a ScriptRule with support for args, but it would be easy to do)

Datseris · 2022-01-16T08:08:01Z

Okay, but that ScriptRule you describe seems an okay solution? What's the problem with implementing that? Yes, you would have to manually write down every single ScriptRule for your project, but I don't see another way for the program to deduce this information.

If the ScriptRule works we can add here via a PR, it fits to the package and doesn't bring any dependencies nor disturbs the workflow.

What is a bit annoying is that you have to manually write down all files like datadir("oct", savename("baseline_oct_result", Dict(:iter_stop=>100, :ν_max=>0.5), "jld2")). Would be nice if we could automate this.

goerz · 2022-01-16T15:50:15Z

Yeah, there’s no problem :-)

The Discourse thread was just inquiring whether someone had already implemented a system like that (it seems not).

This issue is to find out if there’s any interest in having make-like features built in to DrWatson (it seems like “maybe”). So what I’ll do is let my makerules.jl gestate for a while through a few different projects, and then think of making a pull request to get that functionality into DrWatson

goerz · 2022-01-16T16:22:59Z

As for having to declare the output/target files: I agree, especially because it requires duplicating the call to savename and thus declaring all the parameters again in the "makefile". It would be much nicer if the output files could be automatically extracted from the script, so that they're only declared in one place. I just haven't come up with a good way to achieve that.

goerz · 2022-01-16T17:41:06Z

The problem is that the output files generally depend on the arguments, so figuring out the output filenames requires running the code at some level. The only reasonable idea I've come up with would require produce_or_load/@produce_or_load to have a dry_run option, so that

produce_or_load([path="",] config, f; dry_run=true, kwargs...) -> exists, s

where exists is a boolean indicating whether the file already exists or not, and s is the full path to the output file, as before. With that, scripts could self-report their output file(s): They'd have to be run once in dry-run-mode when instantiating the rule, and then once more for the actual run.

The first output, exists, is actually arbitrary for this purpose: I'd only need s. An existence-indicator seemed like the most straightforward thing (apart from nothing/an empty dict), but maybe there's something else that is cheap to calculate and carries more information for other potential use cases.

P.S.: A small nitpick about the documentation. Maybe instead of

produce_or_load([path="",] config, f; kwargs...) -> file, s

it would be clearer to have

produce_or_load([path="",] config, f; kwargs...) -> data, filename

I've gotten confused by file and s every time I've looked at that :-)

goerz · 2022-01-16T17:50:41Z

I'd note that dry_run would be very similar to the existing loadfile, but unlike loadfile, it doesn't produce anything on disk

Datseris · 2022-01-16T19:12:08Z

The problem is that the output files generally depend on the arguments, so figuring out the output filenames requires running the code at some level.

I was thinking of something much, much simpler. Julia has a folder watching functionality (that's how Revise.jl works actually). You can simply use the folder watching package, and tell it "watch these folders. Whenever a file is added in these folders, add its path and name to the X container". X will be a container that can be read by our make file business.

Difficulties:

How to ensure this folder watching works in the backfground all the time
how to ensure than when you save an output file to X, it also knows what's the script it produced it?

Perhaps we can kill two birds with one stone by adding this functionality to the macros @tagsave and @produce_or_load. Because they are macros, they have perfect knowledge of where they are being used (in fact, that's how the script entry of @tagsave! works). So we can add an option to the macros "save both the file name and the script used to run it as a pair script => filename to the X thingy"

Actually, this should work, and seems to be generic enough for your make idea? It also doesn't need folder watching functionality, the macros @produce_or_load do everything by themselves.

Datseris · 2022-01-16T19:13:40Z

P.S.: A small nitpick about the documentation. Maybe instead of

Please consider doing a pull request that fixes this, seems a rather straightforward documentation improvement!

Datseris · 2022-01-16T19:14:38Z

The X thingy can be a jld2 data file which can be read by a function generate_makefile that would read the pairs "script" => "filename" and generate the make file you wish for.

goerz · 2022-01-16T21:21:04Z

Please consider doing a pull request that fixes this, seems a rather straightforward documentation improvement!

Done in #318

I was thinking of something much, much simpler. Julia has a folder watching functionality
[...]
So we can add an option to the macros "save both the file name and the script used to run it as a pair script => filename to the X thingy

Yeah, I don't think folder watching would even be required if you hook something into the existing macros

Actually, this should work, and seems to be generic enough for your make idea?

I'm not sure... it would have to record the command line arguments (ARGS) as well.

For clarity, let me just re-iterate what I'm trying to achieve:

So, let's say I have a DrWatson-based project on Github, which I want people to be able to clone and reproduce. It contains scripts in the scripts subfolder, which produce jld2 files in the data subfolder. Generally, these jld2 files would not be committed to the repository (except maybe if they take more than a couple of hours to produce). Thus, the person trying to reproduce the project will have to run the appropriate script files, possibly with different arguments (!), and possibly in a specific order (if one script uses the output jld2 of another script as input).

That's the part I'm trying to automate: Ideally, the person reproducing the project can simply run make (or julia make.jl). Right now, I'm working with a make.jl that manually specifies which output jld2 files should be generated from which scripts/arguments. The next step I was thinking of would be to have a dry_run option for each script that returns the list of filenames it will generate for a particular set of ARGS, so that the make.jl only needs the list of (script, arguments) and can figure out the names of the jld2 files automatically.

What you're proposing, I think, is to hook into tagsave to store a mapping jld2_file => (script, arguments) in some central location (another jld2 file) that would be committed to the repo and used by make.jl. Actually, I think you may have considered only jld2_file => script which would be easy but not generic enough for a make. The general concept would work, though, at least in principle. I don't think it would require file watching: just every time tagsave produces an output file that's not already in the database, it has to append a record.

I'd see two difficulties, though.

The first is with recording the arguments (ARGS). Specifically, the ARGS are not necessarily the same as the config dict (oct_args) passed to produce_or_run/tagsave. I'm not sure to what extent the macro has access to the arguments. I guess ARGS is a global variable, but if I run the scripts by include instead of from the command line, the way my main functions are currently written, I'm actually not using ARGS but a passed variable args. 🤷 So I guess @tagsave would need an extra args argument to override ARGS as well, which would seem like a bit of a pain.

The second problem is that I might want to play around with parameters and run scripts in the "exploratory" stage of the project, without wanting to record the result for the make. So there'd have to be a way to skip recording, or to clean up the database manually (maybe a human-readable format would be better than jld2 in that case). Another possibility would be to generally record ARGS/args in all output jld2 files generated by @tagsave. There could then be a manual command that looks at all current jld2 files in the project and records their name => (script, args) mapping in the database (under the assumption that I've removed any "exploratory" `jld2 files). Keeping the ordering might be tricky with that approach: I guess for any new files, they should be ordered by modification time, but even that isn't guaranteed to work.

So it seems like it's possible, but not easier than having a dry_run option. Unless I misunderstood what you had in mind :-)

I'd also say that dry_run doesn't seem like a terrible feature on its own, independent of its use in a possible make-like system. Do you think I should make a PR just for that?

Datseris · 2022-01-17T09:14:30Z

What you're proposing, I think, is to hook into tagsave to store a mapping jld2_file => (script, arguments) in some central location (another jld2 file) that would be committed to the repo and used by make.jl

Yes.

The first is with recording the arguments (ARGS). Specifically, the ARGS are not necessarily the same as the config dict (oct_args) passed to produce_or_run/tagsave.

Well, that's kinda your problem. You chose to work with ARGS instead of the arguments passed to the config dictionary of produce_or_load. We shouldn't try to tailor DrWatson to such a specific case. The general case of capturing the arguments passed to produced_or_load is more than good enough and should be the way to go. In the end of the day, just put whatever ARGS contains into the config you pass to DrWatson from your own end, and all problems solved.

The second problem is that I might want to play around with parameters and run scripts in the "exploratory" stage of the project, without wanting to record the result for the make. So there'd have to be a way to skip recording, or to clean up the database manually

no, just put the keyword record _to_makefile = false.

I'd also say that dry_run doesn't seem like a terrible feature on its own, independent of its use in a possible make-like system. Do you think I should make a PR just for that?

An overwhelming amount of options and keywords will swarm the ones that are truly useful, so we should be adding as much as possible only keywords that are genuinely useful.

goerz · 2022-01-18T06:08:22Z

Fair enough.

I don't think the idea of recording in some central database the output files each script produces is going to work for me, though. So I won't pursue that further.

We'll have to agree to disagree with ARGS vs the config dictionary 😉. I don't see myself getting to a point where the config reflects the ARGS one-to-one, but who knows, maybe my thinking on this will evolve.

I'll figure out some way around this. What I might do is have a convention for my scripts that they (optionally) define a targets function in addition to a main function. The make.jl imports the script file, and runs ProgMod.targets(ARGS) to get a list of output filenames. After that, it can run ProgMod.main(ARGS) to actually produce those files.

Datseris · 2022-01-18T09:14:18Z

I don't see myself getting to a point where the config reflects the ARGS one-to-one, but who knows, maybe my thinking on this will evolve.

I must definitely have lost you somewhere, because you can always do, before calling the produce_or_load:

config = merge(config, ENV)

Why, or how, is this a problem?

I don't think the idea of recording in some central database the output files each script produces is going to work for me, though. So I won't pursue that further.

Okay, that's fine, but can you tell me why it won't work for your case though? Only then I can improve it.

goerz · 2022-01-19T04:58:07Z

Only then I can improve it

Well, as far as I'm concerned, there's nothing really to improve within DrWatson at this point.

I thought it would help me to have a "dry-mode" in produce_or_load, but I understand your concerns with the way I implemented it, and moreover, I've come to realize that it only gets me halfway in any case. As you pointed out, I can get the dry-run basically in a single line (calling savename). So it's fine, really.

I must definitely have lost you somewhere

Yes, I agree (or vice versa). We're definitely not on the same page about something, and I'm having a hard time figuring out what it is. As I said, I'm pretty happy at this point, so I'm not sure how much it matters, but let me try to explain.

you can always do, before calling the produce_or_load:
config = merge(config, ENV)

Ok, now I'm super confused. The config is just the dict for savename that gets used to determine the output filename, right? And ENV is the dict of environment variables, which includes dozens of entries that have nothing to do with my simulation and that I definitely wouldn't want to include in my file names. Did you mistype here, and mean something different than ENV?

Why, or how, is this a problem?

I don't think the idea of recording in some central database the output files each script produces is going to work for me, though. So I won't pursue that further.

Okay, that's fine, but can you tell me why it won't work for your case though?

Ok, so let's say I have three scripts in my project

./scripts/baseline.jl
./scripts/constrained_optimization_stage1.jl
./scripts/constrained_optimization_stage2.jl   # reads the stage1 result and continues it

I run these scripts five times as follows from the command line, in each case producing the output file indicated by the comment

julia ./scripts/baseline.jl  # ->  ./data/out/result_iter_stop=100_ν_max=0.0GHz.jld2
julia ./scripts/constrained_optimization_stage1.jl 0.7GHz  # -> ./data/out/result_iter_stop=100_ν_max=0.7GHz.jld2
julia ./scripts/constrained_optimization_stage1.jl 0.5GHz  # -> ./data/out/result_iter_stop=100_ν_max=0.5GHz.jld2
julia ./scripts/constrained_optimization_stage2.jl 0.7GHz  # -> ./data/out/result_iter_stop=1000_ν_max=0.7GHz.jld2
julia ./scripts/constrained_optimization_stage2.jl 0.5GHz  # -> ./data/out/result_iter_stop=1000_ν_max=0.5GHz.jld2

To keep track of these calls, I record the script name and the command line args (e.g., 0.7GHz) in my make.jl. Now, ideally, I want the make.jl to figure out the name of the output file automatically.

It seemed to me what you were proposing is that@produce_or_load should record in some central database the name of each jld2 output file together with the filename of the script and the config dict from which the output filename derives. Clearly, args=["0.7GHZ"] doesn't automatically map to config=Dict("iter_stop" => 100, "ν_max" => "0.7GHZ"). That mapping exists only in the source code of ./scripts/constrained_optimization_stage1.jl. If you were to include the args in the call to @produce_or_load, then they could be included in the database, of course. That would solve the problem, but it would be another parameter for @produce_or_load.

There are still other problems with that approach, like the fact that I don't necessarily want to record every call to @produce_or_load in the database – you mentioned record_to_makefile = false, which would be another parameter for @produce_or_load? Plus (at least while the entire make.jl machinery isn't actually part of DrWatson), I would be the only person that would even use that database at all. So, it feels to me like the implementation would have lots of pitfalls, and look very strange to every other user of DrWatson.

I think I must have misunderstood what you were proposing.

In any case, just asking each script what output files it will produce for given args (by running it in some special mode) seems like the most flexible approach, and one I'm perfectly happy with.

Sorry for the long wall of text. I won't blame you for not going deeper on this, especially since there's no actual problem as far as I'm concerned 😀

goerz · 2022-01-19T05:47:03Z

Only then I can improve it

Well, as far as I'm concerned, there's nothing really to improve within DrWatson at this point.

Actually, there is one thing that would make the code in my script files a little bit cleaner: It would be nice to decouple @produce_or_load from savename. That is, have a method

produce_or_load(f::Function, outfile::AbstractString)

(and similarly for the macro version). The above would simply call f() with no argument and store the result in outfile, or load outfile if it exists.

I'd rather want to write

outfile = joinpath(outdir, savename(prefix, config, "jld2"))
@produce_or_load(outfile) do
    result = do_something(config)  # access `config` as a closure from outer scope
    @strdict result
end

than the current

@produce_or_load(outdir, config, prefix=prefix) do config
    result = do_something(config)  # access `config` as function argument
    @strdict result
end

This is because I have to do outfile = joinpath(outdir, savename(prefix, config, "jld2")) anyway (in my targets function), and it would be both cleaner and safer to only calculate the savename once.

Plus, while I think DrWatson's savename to interpolate key-value pairs into filenames is generally neat, I'm not sure it will fit for every project (cf #316). Sometimes I might just want outfile=datadir("run001.jld") without having to fight @produce_or_load/savename to give me that filename. In contrast, @produce_or_load as such, if it was decoupled from savename, is a universally great feature, IMO.

Anyway, this is a bit cosmetic, but if you think it would make a useful addition, in can probably prepare a PR for that particular new method of produce_or_load.

goerz · 2022-01-19T19:04:58Z

Let me think about this a bit more, though... I might open a separate issue

goerz · 2022-01-20T12:11:19Z

So I don't think this will develop into something that will be closely tied to DrWatson.

If I do ever put this functionality into a proper package, it might have some opinionated expectations on how scripts should be written, which you're probably not on board with.

So let's close this

Datseris added the discussion Let's talk about how things should be! label Jan 11, 2022

goerz mentioned this issue Jan 11, 2022

Touch outputfile in produce_or_load #317

Closed

goerz mentioned this issue Jan 16, 2022

Refactor names of produce_or_load return values #318

Merged

goerz mentioned this issue Jan 17, 2022

Add dry_run option to produce_or_load #319

Closed

goerz mentioned this issue Jan 20, 2022

Decouple produce_or_load and savename #322

Closed

goerz closed this as completed Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tooling for a make.jl file in the project root #315

Tooling for a make.jl file in the project root #315

goerz commented Jan 10, 2022

Datseris commented Jan 11, 2022

goerz commented Jan 11, 2022

Datseris commented Jan 16, 2022 •

edited

goerz commented Jan 16, 2022

goerz commented Jan 16, 2022

goerz commented Jan 16, 2022

goerz commented Jan 16, 2022

Datseris commented Jan 16, 2022 •

edited

Datseris commented Jan 16, 2022

Datseris commented Jan 16, 2022

goerz commented Jan 16, 2022

Datseris commented Jan 17, 2022

goerz commented Jan 18, 2022

Datseris commented Jan 18, 2022

goerz commented Jan 19, 2022

goerz commented Jan 19, 2022

goerz commented Jan 19, 2022

goerz commented Jan 20, 2022

Tooling for a make.jl file in the project root #315

Tooling for a make.jl file in the project root #315

Comments

goerz commented Jan 10, 2022

Datseris commented Jan 11, 2022

goerz commented Jan 11, 2022

Datseris commented Jan 16, 2022 • edited

goerz commented Jan 16, 2022

goerz commented Jan 16, 2022

goerz commented Jan 16, 2022

goerz commented Jan 16, 2022

Datseris commented Jan 16, 2022 • edited

Datseris commented Jan 16, 2022

Datseris commented Jan 16, 2022

goerz commented Jan 16, 2022

Datseris commented Jan 17, 2022

goerz commented Jan 18, 2022

Datseris commented Jan 18, 2022

goerz commented Jan 19, 2022

goerz commented Jan 19, 2022

goerz commented Jan 19, 2022

goerz commented Jan 20, 2022

Datseris commented Jan 16, 2022 •

edited

Datseris commented Jan 16, 2022 •

edited