Ambiguity on PythonScriptStep's allow_reuse parameter #298

dataders · 2019-04-05T22:32:54Z

What exactly does "settings/inputs" mean in this scenario?
For example, If allow_reuse = True will a new run be generated I change:

the code of the underlying Python script provided to the script_name parameter,
or a script_arg?

allow_reuse
bool
Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.

I might be wrong, but I think in both cases, a new run will not be generated. This is rather frustrating when developing a pipeline with many steps...

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 2836b6f1-7d38-6091-e67a-1a4305fc8f4d
Version Independent ID: 4153252b-fa22-0ccd-b9bd-a460f422339f
Content: azureml.pipeline.steps.python_script_step.PythonScriptStep class - Azure Machine Learning Python
Content Source: AzureML/docs-ref-autogen/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.PythonScriptStep.yml
Service: machine-learning
Sub-service: core
GitHub Login: @j-martens
Microsoft Alias: jmartens

The text was updated successfully, but these errors were encountered:

sanpil · 2019-04-09T16:23:00Z

The default behavior of a Step execution in Pipelines is that when the script specified in the Step using script_name, inputs, and the parameters of a step remain the same, the output of a previous step run will be reused instead of running the step again. When a step is reused, the job is not submitted to the compute, instead, the results from the previous run are immediately available to the next step runs.
Azure Machine Learning Pipelines provide ways to control and alter this behavior.

### allow_reuse Flag
You can specify allow_reuse=False as a parameter of the Step. When allow_reuse is set to False, the step run won’t be reused, and a new run will always be generated for the step during pipeline execution. Default behavior of Pipelines is to set allow_reuse=True for steps.

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False
                        )

### regenerate_outputs Flag
If regenerate_outputs is set to True for the Experiment.Submit() call, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run. Default behavior of Pipelines is to set regenerate_outputs=False for experiment submit calls.

exp = Experiment(ws, 'Hello_World')
pipeline_run = exp.submit(pipeline, regenerate_outputs=False)

### hash_paths Parameter
By default, only the main script file is hashed. In addition to the default hashing behavior of the script, you can specify files or directories to be included in the hash calculations via the hash_paths parameter. Paths specified here can be absolute paths or relative paths to the source_directory. To include all contents of source_directory, specify hash_paths='.'
With the additional parameter as specified below, each pipeline run tracks changes when either the script or the notebook hello_world.ipynb changes.

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False,
 				 hash_paths=['hello_world.ipynb']

dataders · 2019-04-09T23:47:05Z

@sanpil ok this is great information. thank you.

If I do the following, will the step be reused the second time i run the pipeline?

define the the PythonScriptStep below
submit said step as part of a PipelineRun for the first time and it completes without error
change the code inside of hello_world.py
run the pipeline again

hello_step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=True)

If it does not re-run, I would make the case that this case should be reflected in the allow_reuse parameter's description.

sanpil · 2019-04-09T23:52:28Z

In the above scenario, if you make a change to hello_world.py then rerun the pipeline, the step WILL NOT be reused (it will re-run). If you see otherwise, please provide the Pipeline Run ID and Step Run ID and we will take a look.

gargiulman · 2019-08-09T10:17:32Z

hello,
I would like to know when the input data to the step changes i.e consider from X rows its now increased to Y rows, and allow_reuse = True, i have seen the step not re-running and using the prev step run result. Is this an expected scenario? I would ideally consider the step to rerun because the input data has changed, though the input data file name is same.

sanpil · 2019-08-09T20:49:28Z

If the data is in a datastore, we would not be able to detect the data change. If the data is uploaded as part of the snapshot (under source_directory) [this is not recommended though], then the hash will change and will trigger a rerun.

kabirkhan · 2019-12-04T21:50:01Z

On this same note, it's confusing that the wording/explanation changes between docs. In the main how-to guides and even in the comments it says that if the script changes the pipeline will not reuse the previous results.

Seems like the actual behavior is if the snapshot changes the pipeline will not be reused as stated in the remarks section: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py#remarks

This behavior makes sense but the documentation is inconsistent which makes it confusing.

yychenca · 2021-06-10T15:25:26Z

I'm curious what if I modified the underlying script of a PythonStep then run the pipeline again, will the pipeline pickup already finished steps' outputs (reuse) and only rerun steps with modified underlying script and onwards?

For instance,

Define a pipeline as below
Run the pipeline and it finished successfully
Modify the underlying script for train_step
Run the pipeline again

my_pipeline =  Pipeline(workspace=ws, steps=[split_data_step, train_step])

Would AML reuse the outputs from the successfully finished split_data_step and only rerun train_step?

In the above scenario, if you make a change to hello_world.py then rerun the pipeline, the step WILL NOT be reused (it will re-run). If you see otherwise, please provide the Pipeline Run ID and Step Run ID and we will take a look.

dataders · 2021-06-10T15:52:44Z

Would AML reuse the outputs from the successfully finished split_data_step and only rerun train_step?

@yychenca, yes it will! To me, this is the killer feature of Azure ML pipelines.

The important thing to note here is that you must have unique source_directorys for each PythonScriptStep. When the pipeline is submitted each source directory is hashed and compared to the previous run to see if there has been changes. If the scripts for multiple steps are in the same folder, changing one script would force all steps to re-run.

Feel free to reply back if you aren't seeing the intended behavior.

yychenca · 2021-06-10T15:55:50Z

This is awesome! I just found the same suggestion from #734 (comment)

Will give it a try and report back my findings. Thanks!

kenkadota closed this as completed May 22, 2019

sanpil mentioned this issue Aug 9, 2019

allow_reuse=True bug #379

Closed

dataders mentioned this issue Feb 6, 2020

PythonScriptStep allow_reuse param doesn't hash script arguments? #771

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ambiguity on PythonScriptStep's allow_reuse parameter #298

Ambiguity on PythonScriptStep's allow_reuse parameter #298

dataders commented Apr 5, 2019

sanpil commented Apr 9, 2019

dataders commented Apr 9, 2019

sanpil commented Apr 9, 2019

gargiulman commented Aug 9, 2019

sanpil commented Aug 9, 2019

kabirkhan commented Dec 4, 2019

yychenca commented Jun 10, 2021

dataders commented Jun 10, 2021

yychenca commented Jun 10, 2021

Ambiguity on PythonScriptStep's allow_reuse parameter #298

Ambiguity on PythonScriptStep's allow_reuse parameter #298

Comments

dataders commented Apr 5, 2019

I might be wrong, but I think in both cases, a new run will not be generated. This is rather frustrating when developing a pipeline with many steps...

Document Details

sanpil commented Apr 9, 2019

dataders commented Apr 9, 2019

sanpil commented Apr 9, 2019

gargiulman commented Aug 9, 2019

sanpil commented Aug 9, 2019

kabirkhan commented Dec 4, 2019

yychenca commented Jun 10, 2021

dataders commented Jun 10, 2021

yychenca commented Jun 10, 2021