Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WORKFLOWS-220] Create py-orca demonstration script #23

Merged
merged 6 commits into from
May 31, 2023

Conversation

BrunoGrandePhD
Copy link
Contributor

@BrunoGrandePhD BrunoGrandePhD commented May 18, 2023

The goal of this script is to demonstrate how you can use py-orca to process a dataset (in this case, RNA-seq) using a series of workflow runs: nf-synstage, nf-core/rnaseq, and nf-synindex. I made some minor modifications to the py-orca module to make it easier to write this script. Its current location is temporary. We can discuss whether we want to move this to a subfolder or a separate repository.

One key improvement over the scripts that I wrote for NTAP is that I wanted to create an actual DAG. Doing this in Airflow would make this challenging for users to play around with it. Instead, I decided to experiment with Metaflow, which makes it easy to test a DAG locally (but offers the chance to deploy it to AWS later). It's similar to Prefect and Dagster. We can revisit this once we settle on a strategy for consistency between Airflow and non-Airflow DAGs.

python3 demo.py run --dataset_id syn51514585

Some of the components in this script (e.g. RnaseqDataset and TowerRnaseqFlow) could be reused in different contexts. We should also discuss if this is desirable and if so, where to store those abstract components (e.g. in py-orca or elsewhere).

  • Document demo.py in README
  • Parameterize the S3 output prefixes
Command and Output

pipenv run python3 demo.py run --dataset_id syn51514585
Loading .env environment variables...
[2023-05-18T15:49:37.888-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
Metaflow 2.9.1 executing TowerRnaseqFlow for user:bgrande
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2023-05-18 15:49:38.591 Workflow starting (run-id 1684450178581424):
2023-05-18 15:49:38.599 [1684450178581424/start/1 (pid 40447)] Task is starting.
2023-05-18 15:49:40.521 [1684450178581424/start/1 (pid 40447)] [2023-05-18T15:49:40.515-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:49:40.952 [1684450178581424/start/1 (pid 40447)] Task finished successfully.
2023-05-18 15:49:40.997 [1684450178581424/load_dataset/2 (pid 40701)] Task is starting.
2023-05-18 15:49:43.384 [1684450178581424/load_dataset/2 (pid 40701)] [2023-05-18T15:49:43.379-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:49:47.468 [1684450178581424/load_dataset/2 (pid 40701)] Task finished successfully.
2023-05-18 15:49:47.482 [1684450178581424/transfer_samplesheet_to_s3/3 (pid 40914)] Task is starting.
2023-05-18 15:49:49.356 [1684450178581424/transfer_samplesheet_to_s3/3 (pid 40914)] [2023-05-18T15:49:49.352-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:49:54.487 [1684450178581424/transfer_samplesheet_to_s3/3 (pid 40914)] Task finished successfully.
2023-05-18 15:49:54.498 [1684450178581424/launch_synstage/4 (pid 40962)] Task is starting.
2023-05-18 15:49:56.278 [1684450178581424/launch_synstage/4 (pid 40962)] [2023-05-18T15:49:56.274-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:00.079 [1684450178581424/launch_synstage/4 (pid 40962)] 2023-05-18 15:50:00,079 - INFO - Found a previous run: my_test_dataset_synstage_2 (id='1OYvkISqWZ3XPt', state=<WorkflowState.SUCCEEDED: 'SUCCEEDED'>)
2023-05-18 15:50:00.080 [1684450178581424/launch_synstage/4 (pid 40962)] [2023-05-18T15:50:00.079-0700] {ops.py:145} INFO - Found a previous run: my_test_dataset_synstage_2 (id='1OYvkISqWZ3XPt', state=<WorkflowState.SUCCEEDED: 'SUCCEEDED'>)
2023-05-18 15:50:00.530 [1684450178581424/launch_synstage/4 (pid 40962)] Task finished successfully.
2023-05-18 15:50:00.543 [1684450178581424/monitor_synstage/5 (pid 40978)] Task is starting.
2023-05-18 15:50:02.377 [1684450178581424/monitor_synstage/5 (pid 40978)] [2023-05-18T15:50:02.374-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:03.997 [1684450178581424/monitor_synstage/5 (pid 40978)] 2023-05-18 15:50:03,996 - INFO - Workflow(run_name=my_test_dataset_synstage_2, id=1OYvkISqWZ3XPt, state=WorkflowState.SUCCEEDED) is now done!
2023-05-18 15:50:03.997 [1684450178581424/monitor_synstage/5 (pid 40978)] [2023-05-18T15:50:03.996-0700] {ops.py:273} INFO - Workflow(run_name=my_test_dataset_synstage_2, id=1OYvkISqWZ3XPt, state=WorkflowState.SUCCEEDED) is now done!
2023-05-18 15:50:04.413 [1684450178581424/monitor_synstage/5 (pid 40978)] Task finished successfully.
2023-05-18 15:50:04.425 [1684450178581424/launch_rnaseq/6 (pid 40984)] Task is starting.
2023-05-18 15:50:06.380 [1684450178581424/launch_rnaseq/6 (pid 40984)] [2023-05-18T15:50:06.375-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:09.552 [1684450178581424/launch_rnaseq/6 (pid 40984)] 2023-05-18 15:50:09,552 - INFO - Found a previous run: my_test_dataset_rnaseq_2 (id='2g9BNKlIOhe7r3', state=<WorkflowState.SUCCEEDED: 'SUCCEEDED'>)
2023-05-18 15:50:09.552 [1684450178581424/launch_rnaseq/6 (pid 40984)] [2023-05-18T15:50:09.552-0700] {ops.py:145} INFO - Found a previous run: my_test_dataset_rnaseq_2 (id='2g9BNKlIOhe7r3', state=<WorkflowState.SUCCEEDED: 'SUCCEEDED'>)
2023-05-18 15:50:09.970 [1684450178581424/launch_rnaseq/6 (pid 40984)] Task finished successfully.
2023-05-18 15:50:09.984 [1684450178581424/monitor_rnaseq/7 (pid 41018)] Task is starting.
2023-05-18 15:50:11.690 [1684450178581424/monitor_rnaseq/7 (pid 41018)] [2023-05-18T15:50:11.687-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:13.294 [1684450178581424/monitor_rnaseq/7 (pid 41018)] 2023-05-18 15:50:13,293 - INFO - Workflow(run_name=my_test_dataset_rnaseq_2, id=2g9BNKlIOhe7r3, state=WorkflowState.SUCCEEDED) is now done!
2023-05-18 15:50:13.294 [1684450178581424/monitor_rnaseq/7 (pid 41018)] [2023-05-18T15:50:13.293-0700] {ops.py:273} INFO - Workflow(run_name=my_test_dataset_rnaseq_2, id=2g9BNKlIOhe7r3, state=WorkflowState.SUCCEEDED) is now done!
2023-05-18 15:50:13.776 [1684450178581424/monitor_rnaseq/7 (pid 41018)] Task finished successfully.
2023-05-18 15:50:13.790 [1684450178581424/launch_synindex/8 (pid 41034)] Task is starting.
2023-05-18 15:50:16.107 [1684450178581424/launch_synindex/8 (pid 41034)] [2023-05-18T15:50:16.101-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:19.315 [1684450178581424/launch_synindex/8 (pid 41034)] 2023-05-18 15:50:19,315 - INFO - Found a previous run: my_test_dataset_synindex (id='2VK7DTe3habe5l', state=<WorkflowState.SUCCEEDED: 'SUCCEEDED'>)
2023-05-18 15:50:19.315 [1684450178581424/launch_synindex/8 (pid 41034)] [2023-05-18T15:50:19.315-0700] {ops.py:145} INFO - Found a previous run: my_test_dataset_synindex (id='2VK7DTe3habe5l', state=<WorkflowState.SUCCEEDED: 'SUCCEEDED'>)
2023-05-18 15:50:19.703 [1684450178581424/launch_synindex/8 (pid 41034)] Task finished successfully.
2023-05-18 15:50:19.715 [1684450178581424/monitor_synindex/9 (pid 41051)] Task is starting.
2023-05-18 15:50:21.526 [1684450178581424/monitor_synindex/9 (pid 41051)] [2023-05-18T15:50:21.523-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:23.151 [1684450178581424/monitor_synindex/9 (pid 41051)] 2023-05-18 15:50:23,151 - INFO - Workflow(run_name=my_test_dataset_synindex, id=2VK7DTe3habe5l, state=WorkflowState.SUCCEEDED) is now done!
2023-05-18 15:50:23.151 [1684450178581424/monitor_synindex/9 (pid 41051)] [2023-05-18T15:50:23.151-0700] {ops.py:273} INFO - Workflow(run_name=my_test_dataset_synindex, id=2VK7DTe3habe5l, state=WorkflowState.SUCCEEDED) is now done!
2023-05-18 15:50:23.551 [1684450178581424/monitor_synindex/9 (pid 41051)] Task finished successfully.
2023-05-18 15:50:23.565 [1684450178581424/end/10 (pid 41058)] Task is starting.
2023-05-18 15:50:25.807 [1684450178581424/end/10 (pid 41058)] [2023-05-18T15:50:25.802-0700] {crypto.py:83} WARNING - empty cryptography key - values will not be stored encrypted.
2023-05-18 15:50:25.872 [1684450178581424/end/10 (pid 41058)] Completed processing RnaseqDataset(id='my_test_dataset', samplesheet='s3://orca-service-test-project-tower-scratch/30days/my_test_dataset.csv', output_folder='syn51514559')
2023-05-18 15:50:25.873 [1684450178581424/end/10 (pid 41058)] synstage workflow ID: 1OYvkISqWZ3XPt
2023-05-18 15:50:25.873 [1684450178581424/end/10 (pid 41058)] nf-core/rnaseq workflow ID: 2g9BNKlIOhe7r3
2023-05-18 15:50:25.873 [1684450178581424/end/10 (pid 41058)] synindex workflow ID: 2VK7DTe3habe5l
2023-05-18 15:50:26.278 [1684450178581424/end/10 (pid 41058)] Task finished successfully.
2023-05-18 15:50:26.280 Done!

@BrunoGrandePhD BrunoGrandePhD requested a review from a team May 18, 2023 22:39
@swarmia
Copy link

swarmia bot commented May 18, 2023

@codecov
Copy link

codecov bot commented May 18, 2023

Codecov Report

Merging #23 (e109112) into main (50e33ec) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main       #23   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           28        28           
  Lines          854       869   +15     
  Branches       134       137    +3     
=========================================
+ Hits           854       869   +15     
Impacted Files Coverage Δ
src/orca/services/nextflowtower/__init__.py 100.00% <100.00%> (ø)
src/orca/services/nextflowtower/models.py 100.00% <100.00%> (ø)
src/orca/services/synapse/client_factory.py 100.00% <100.00%> (ø)
src/orca/services/synapse/ops.py 100.00% <100.00%> (ø)

Copy link
Collaborator

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! See discussion points.

demo.py Show resolved Hide resolved
demo.py Outdated Show resolved Hide resolved
demo.py Show resolved Hide resolved
demo.py Outdated Show resolved Hide resolved
demo.py Show resolved Hide resolved
@BrunoGrandePhD
Copy link
Contributor Author

@thomasyu888 @BWMac Could at least one of your try to run the script locally as per my instructions in the README? In theory, the synindex step should work because DPE has admin access to the output_folder, but let me know if you experience otherwise.

@BWMac
Copy link
Contributor

BWMac commented May 30, 2023

@BrunoGrandePhD Giving it a shot now!

After working out a couple of kinks with Bruno, the whole process was executed successfully. I think the instructions are great and it's easy for someone to try it out for themselves.

@BrunoGrandePhD BrunoGrandePhD merged commit dcb324b into main May 31, 2023
10 checks passed
@BrunoGrandePhD BrunoGrandePhD deleted the bgrande/WORKFLOWS-220/demo-script branch May 31, 2023 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants