Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[elan] Deploy Elan VM before April downtime #41

Closed
SamStudio8 opened this issue Mar 29, 2021 · 8 comments
Closed

[elan] Deploy Elan VM before April downtime #41

SamStudio8 opened this issue Mar 29, 2021 · 8 comments

Comments

@SamStudio8
Copy link
Member

No description provided.

@SamStudio8
Copy link
Member Author

RP has provisioned a machine to use during the downtime period. I'll work on packaging Elan up a bit better to deploy it there before next Thursday.

@SamStudio8
Copy link
Member Author

Will also need to sort out mqtt (#22) to ensure Asklepian runs.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Mar 31, 2021

  • Elan removed from head node crontab
  • Elan added to Elan node crontab
  • go-full-elan script updated to parameterise nextflow configuration SamStudio8/elan-nextflow@3163e95
  • Nextflow configuration temporarily pointed to new config set to use 120/128 cores and no SLURM
  • Tested Ocarina -- we can reach Majora with OAuth credentials
  • Update mqtt-message script to allow for host to be overridden SamStudio8/elan-nextflow@2e922fc
  • go-full-elan script updated to parameterise MQTT host SamStudio8/elan-nextflow@4d59dd8
  • Tested mqtt message sending

Committed to production -- tomorrow's Elan run should use the new node

@SamStudio8
Copy link
Member Author

Absolutely. Blazing.

[b5/f5d0c3] process > save_manifest            [100%] 1 of 1 ✔
[8c/e75c97] process > resolve_uploads          [100%] 1 of 1 ✔
[a5/5d8a66] process > samtools_quickcheck      [100%] 7122 of 7122 ✔
[0a/057692] process > fasta_quickcheck         [100%] 7122 of 7122 ✔
[8b/7e1ba3] process > save_uploads             [100%] 7122 of 7122, failed: 473 ✔
[45/19d2ca] process > rehead_bam               [100%] 6649 of 6649 ✔
[3f/ad37b5] process > samtools_filter_and_sort [100%] 6649 of 6649 ✔
[db/65c5e6] process > samtools_index           [100%] 6649 of 6649 ✔
[a6/17a025] process > samtools_depth           [100%] 6649 of 6649 ✔
[0e/570933] process > rehead_fasta             [100%] 6649 of 6649 ✔
[a0/56acd0] process > swell                    [100%] 6649 of 6649 ✔
[31/85558d] process > post_swell               [100%] 6649 of 6649 ✔
[b9/ea6754] process > ocarina_ls               [100%] 6649 of 6649 ✔
Completed at: 01-Apr-2021 13:12:47
Duration    : 4h 17m 37s
CPU hours   : 477.0 (0% failed)
Succeeded   : 74'087
Ignored     : 473
Failed      : 473

🔥

@SamStudio8
Copy link
Member Author

RP says the connection to Majora will be a little slower from this node, but we're able to blow twice as many Ocarinas at the post-Elan step. Publishing and MQTT still to come.

@SamStudio8
Copy link
Member Author

Forgot that part of the post-Elan publish step is sent to SLURM but have parameterised the publish mode and committed that change to Elan (SamStudio8/elan-nextflow@c4f9dcc). After a little conda faff (is it bioinformatics without it?) everything finished up in record time, and emitted a message to MQTT without trouble. Will keep an eye out tomorrow morning but I'm happy.

@SamStudio8
Copy link
Member Author

Encountered an issue early this morning caused by Nextflow exceeding the thread pool limit when resuming tasks with the local executor. This has been reported before (nextflow-io/nextflow#1871) and the proposed fix to add -Dnxf.pool.type=sync to NXF_OPTS has been deployed and seems to be working.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Apr 5, 2021

Additionally, the swell step appears to have segfaulted for a very small number of samples (n=3) which seems to be related to a resource limit causing the initialisation of numpy to fail -- killing swell. RP has increased all process and file handler limits and we'll keep an eye on tomorrow's run.

Error below for posterity;

OpenBLAS blas_thread_init: pthread_create failed for thread 63 of 64: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 65535 max

Update: This error has gone away now. We might want to lower OPENBLAS_NUM_THREADS from the default of 64 anyway, but it isn't urgent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant