One of the main uses of Renku is that it lets you track commands that you execute on the command line, rerun them, compose them into bigger pipelines and inspect how files were generated in your project.
For any command you would usually run on the command line, you can just pass
that command through Renku (by prepending renku run
) when running it to track the execution.
For instance, if we had a script.py
that reads a file, appends text to the
content and writes it out to a different file, like
import sys
input_path = sys.argv[1]
output_path = sys.argv[2]
append_text = sys.argv[3]
with open(input_path, "r") as input_file, open(output_path, "w") as output_file:
text = input_file.read() + append_text
output_file.write(text)
that you normally call like
$ python script.py data.csv output.txt "my text"
You can just call it with Renku like
$ renku run -- python script.py data.csv output.txt "my text"
This would
- Track this execution of the command, detecting any files it used as input and
generated as output, as well as the text parameter
"my text"
- Add the recorded execution in the overall directed acyclic graph (DAG) that links together workflow executions
- Create a Plan entity, which serves as a recipe for the command you just executed, allowing you to execute it with different input, output or parameter values
- Allow you to detect out of date outputs should the input change in the future
You can see that all your workflow outputs are up to date using
$ renku status
Everything is up-to-date.
Right now, everything is fine since we didn't make any changes. But if we modify
data.csv
, we would get
$ renku status Outdated outputs(1):
(use `renku workflow visualize [<file>...]` to see the full lineage) (use
`renku update --all` to generate the file from its latest inputs)
output.txt: data.csv
Modified inputs(1):
data.csv
This tells us that data.csv
was changed and as a result output.txt
is
out of date and should be updated.
We can do so using
$ renku update output.txt
Resolved '../../../../../tmp/tmp9wtjmp5_' to
'file:///tmp/tmp9wtjmp5_' [job 1f2c73c4-01d9-40cc-b351-b13e48c51577]
/tmp/xkjzau4m$ python \
/tmp/xkjzau4m/script.py \ /tmp/xkjzau4m/data.csv \ output.txt "my text"
[job 1f2c73c4-01d9-40cc-b351-b13e48c51577] completed success Moving outputs
[ ] 1/1
This runs the command we recorded earlier, with the new input data, to create
output.txt
again. Renku is smart enough to only run those parts of the DAG
that changed and need to be updated.
Sometimes there are cases where the automated detection of inputs/outputs/parameters doesn't work or is not sufficient.
Lets say our script.py
looked instead like:
with open("data.csv", "r") as input_file, open("output.txt", "w") as output_file:
text = input_file.read() + "my text"
output_file.write()
Renku doesn't know that you script reads data.csv
as an input, because it
does not show up on the command line. Though it would still detect
output.txt
as an output since it monitors files on disk for changes.
You could let renku know manually that this is the case by running
$ renku run --input data.csv --output output.txt -- python script.py
This would let Renku know that this script has one input data.csv
along
with one output output.txt
.
Renku will automatically generate names for inputs, outputs and parameters on
the created Plan, so they can be used in other Renku commands such as renku
workflow execute
. You can also specify the names directly to have more human
readable names, by prepending the name like:
$ renku run --input data_file=data.csv --output result=output.txt -- python script.py
This would set the name for the input file to data_file
and the name for the
output file to result
.
Similarly, if you had a command python script.py example
and there is a
file named example
on disk, renku would detect it as an input. But if this
was just a coincidence and example
was actually a string input unrelated to
the file, you could run renku run --parameter my_param="example" -- python
script.py example
to let renku know that example
is a parameter, not an
input file.
Alternatively, you can also specify this information in a YAML file, which is nicer in cases where there are many inputs or you want to specify inputs programmatically.
In this case, the file would look like
data_file: data.csv
and should be stored as .renku/tmp/inputs.yml
. along with
result: output.txt
stored as .renku/tmp/outputs.yml
.
Then running the command normally will pick this up and add it to the workflow metadata, so it just becomes:
$ renku run -- python script.py
Note that while this allows renku to track data.csv
as an input, it does not
allow you to specify a different path for the input later on, as the path is
hard-coded in your code.
The same can be done with .renku/tmp/parameters.yml
for parameters.
A third option if you are working with Python is to make use of the Renku Python API. This lets you specify inputs/outputs/parameters directly in code. Our script would the look something like this:
from renku.api import Input, Output, Parameter
with open(Input("data_file", "data.csv"), "r") as input_file, open(Output("result", "output.txt"), "w") as output_file:
text = input_file.read() + Parameter("append_text", "my text").value
output_file.write()
and run it like
$ renku run -- python script.py
This achieves the same as in the examples above, specifying that data.csv
is
an input, output.txt
is an output and example
is a parameter. It names
the references on the created Plan data_file
, result
and append_text
respectively. The big benefit of this approach is that it does allow changing
the values used when executing the created workflow again, e.g. using renku
workflow execute
. Then for instance the Input(...)
part could return the
modified value instead of the hard-coded data.csv
.
If you do not want renku to try and automatically detect inputs or outputs, you
can use the --no-input-detection
or --no-output-detection
flags to
renku run
, respectively. You can also let Renku know that a workflow does
not produce an output file with the --no-output
flag.