Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity on PythonScriptStep's allow_reuse parameter #298

Closed
dataders opened this issue Apr 5, 2019 · 9 comments
Closed

Ambiguity on PythonScriptStep's allow_reuse parameter #298

dataders opened this issue Apr 5, 2019 · 9 comments

Comments

@dataders
Copy link

dataders commented Apr 5, 2019

What exactly does "settings/inputs" mean in this scenario?
For example, If allow_reuse = True will a new run be generated I change:

  • the code of the underlying Python script provided to the script_name parameter,
  • or a script_arg?

allow_reuse
bool
Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.

I might be wrong, but I think in both cases, a new run will not be generated. This is rather frustrating when developing a pipeline with many steps...

Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

@sanpil
Copy link

sanpil commented Apr 9, 2019

The default behavior of a Step execution in Pipelines is that when the script specified in the Step using script_name, inputs, and the parameters of a step remain the same, the output of a previous step run will be reused instead of running the step again. When a step is reused, the job is not submitted to the compute, instead, the results from the previous run are immediately available to the next step runs.
Azure Machine Learning Pipelines provide ways to control and alter this behavior.

### allow_reuse Flag
You can specify allow_reuse=False as a parameter of the Step. When allow_reuse is set to False, the step run won’t be reused, and a new run will always be generated for the step during pipeline execution. Default behavior of Pipelines is to set allow_reuse=True for steps.

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False
                        )

### regenerate_outputs Flag
If regenerate_outputs is set to True for the Experiment.Submit() call, a new submit will always force generation of all step outputs, and disallow data reuse for any step of this run. Once this run is complete, however, subsequent runs may reuse the results of this run. Default behavior of Pipelines is to set regenerate_outputs=False for experiment submit calls.

exp = Experiment(ws, 'Hello_World')
pipeline_run = exp.submit(pipeline, regenerate_outputs=False)

### hash_paths Parameter
By default, only the main script file is hashed. In addition to the default hashing behavior of the script, you can specify files or directories to be included in the hash calculations via the hash_paths parameter. Paths specified here can be absolute paths or relative paths to the source_directory. To include all contents of source_directory, specify hash_paths='.'
With the additional parameter as specified below, each pipeline run tracks changes when either the script or the notebook hello_world.ipynb changes.

step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=False,
 				 hash_paths=['hello_world.ipynb']

@dataders
Copy link
Author

dataders commented Apr 9, 2019

@sanpil ok this is great information. thank you.

If I do the following, will the step be reused the second time i run the pipeline?

  1. define the the PythonScriptStep below
  2. submit said step as part of a PipelineRun for the first time and it completes without error
  3. change the code inside of hello_world.py
  4. run the pipeline again
hello_step = PythonScriptStep(name="Hello World",
                         script_name="hello_world.py", 
                         compute_target=aml_compute, 
                         source_directory= source_directory,
                         allow_reuse=True)

If it does not re-run, I would make the case that this case should be reflected in the allow_reuse parameter's description.

@sanpil
Copy link

sanpil commented Apr 9, 2019

In the above scenario, if you make a change to hello_world.py then rerun the pipeline, the step WILL NOT be reused (it will re-run). If you see otherwise, please provide the Pipeline Run ID and Step Run ID and we will take a look.

@gargiulman
Copy link

hello,
I would like to know when the input data to the step changes i.e consider from X rows its now increased to Y rows, and allow_reuse = True, i have seen the step not re-running and using the prev step run result. Is this an expected scenario? I would ideally consider the step to rerun because the input data has changed, though the input data file name is same.

@sanpil
Copy link

sanpil commented Aug 9, 2019

If the data is in a datastore, we would not be able to detect the data change. If the data is uploaded as part of the snapshot (under source_directory) [this is not recommended though], then the hash will change and will trigger a rerun.

Copy link

On this same note, it's confusing that the wording/explanation changes between docs. In the main how-to guides and even in the comments it says that if the script changes the pipeline will not reuse the previous results.

Seems like the actual behavior is if the snapshot changes the pipeline will not be reused as stated in the remarks section: https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py#remarks

This behavior makes sense but the documentation is inconsistent which makes it confusing.

@yychenca
Copy link

I'm curious what if I modified the underlying script of a PythonStep then run the pipeline again, will the pipeline pickup already finished steps' outputs (reuse) and only rerun steps with modified underlying script and onwards?

For instance,

  1. Define a pipeline as below
  2. Run the pipeline and it finished successfully
  3. Modify the underlying script for train_step
  4. Run the pipeline again
my_pipeline =  Pipeline(workspace=ws, steps=[split_data_step, train_step])

Would AML reuse the outputs from the successfully finished split_data_step and only rerun train_step?

In the above scenario, if you make a change to hello_world.py then rerun the pipeline, the step WILL NOT be reused (it will re-run). If you see otherwise, please provide the Pipeline Run ID and Step Run ID and we will take a look.

@dataders
Copy link
Author

Would AML reuse the outputs from the successfully finished split_data_step and only rerun train_step?

@yychenca, yes it will! To me, this is the killer feature of Azure ML pipelines.

The important thing to note here is that you must have unique source_directorys for each PythonScriptStep. When the pipeline is submitted each source directory is hashed and compared to the previous run to see if there has been changes. If the scripts for multiple steps are in the same folder, changing one script would force all steps to re-run.

Feel free to reply back if you aren't seeing the intended behavior.

@yychenca
Copy link

This is awesome! I just found the same suggestion from #734 (comment)

Will give it a try and report back my findings. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants