Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minutes 22-2021-09-01 #132

Merged
merged 2 commits into from Sep 2, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
60 changes: 60 additions & 0 deletions minutes/22-2021-09-01.md
@@ -0,0 +1,60 @@
| <img src="/assets/dipi.png" alt="DIPI Badge" width="150"> | Data Integrity and Pipeline Integration (DIPI) Working Group |
| -------------- | -------------------- |
| **Meeting** | 22 Show and Text |
| **Date** | 2021-09-01 |
| **Attendance** | SN RC VH |
| **Away** | |


## Special items

* Head node migration has been a success, stability and responsiveness seems to have improved
* SN has noticed that recovering the Nextflow cache for large jobs like ENA BAM is much faster
* DIPI developers should note the new head node has fewer cores and less RAM so some resource limits or local CPU usage settings may need to be adjusted
* SN has hired a new junior developer who should be joining CLIMB at the end of the month
* Expecting to task them with review of Elan performance, improving monitoring across CLIMB-COVID and implementing more QC

## Standing items

### Current development work

* SN has been tasked to build a new system to allow uploads from third-parties to the CLIMB-COVID platform without directly providing access to those parties
* Likely some form of broker object store which will collect and inject sequences and metadata to CLIMB and Majora if data passes more strict QC rules (as commerical providers are not typically offering the same standards as COG)
* SN used some CLI-fu to diagnose a `samtools depth` bottleneck in Elan during last week's large run (n=17312, officially our largest yet)
* Checking the samtools changelog, SN discovered a more recent version (v1.13) offers a rewritten version of the depth subcommand
* Swapping this in significantly improves the performance of the depth step (20x), speeding Elan up by around 3x total!
* See https://github.com/COG-UK/dipi-group/issues/129
* SN quietly fixed the amusing `1` inbound message after the problem degraded to suppressing one of the inbound messages entirely (https://github.com/COG-UK/dipi-group/issues/54)
* RP has begun deleting old FASTA and BAM from user `uploads/` as announced previously to release storage
* VH asked about the status of the `adm1` changes
* SN has not heard back on the relevant ethics so the issue has been closed for now
* RC has identified a slowdown in both phylopipe and datapipe
* To mitigate, RC has parallelised the datapipe alignment bottleneck
* Initially suspected that for the other processes there might be a resource problem on SLURM but after investigation the publish steps appear to fail due to memory limitations and need to be retried, adding to runtime
* RC has also identified the slowest step of phylopipe and plans to parallelise it but needs to get a good `cProfile` report first, VH has offered to help on this
* Long running time of datapipe is causing errors on Wednesdays as the GISAID datapipe is interfering with the COG datapipe

### Pipeline updates

* Head node blipped on Monday due to a disk error, RP has updated ceph to a new version offering bug fixes and stability improvements
* Elan has been running incredibly quickly since the `samtools depth` update


### Threats

* The ENA BAM pipeline has been running like garbage the past couple of weeks
* An issue in the SLURM controller causes some of the dehumanising jobs to drain the memory on the compute nodes (when they should be killed), reducing throughput
* Coupled with the reduction in CPU and RAM on the head node, the throughput for `kraken2` jobs has also reduced
* The reliability of the ENA API causes submissions to fail, causing expensive jobs to roll over to the following week
* https://github.com/COG-UK/dipi-group/issues/127, https://github.com/COG-UK/dipi-group/issues/128, https://github.com/COG-UK/dipi-group/issues/131
* RP plans to set up a VM for SN to use for the dehumanising pipeline without the need for SLURM

### AOB

* Discovered that Nextflow `-resume` will arbitrarily copy the last `sessionId` (regardless of whether the inputs and paths have changed)
* Has the effect that all `datapipe` runs are in the same session
* https://github.com/nextflow-io/nextflow/blob/73899f3b7967f3bd9bc393749a80ce893368a28e/modules/nextflow/src/main/groovy/nextflow/config/ConfigBuilder.groovy#L502
* Additionally, it appears the `clusterOptions` in the Nextflow config have the potential to override resource limits specified in the Nextflow workflow
* `clusterOptions` are added as the last `SBATCH` line when the job file is written; overriding any preceding `SBATCH` line
* Default `clusterOptions` on CLIMB adds `--time 300:0` which will override any `time` directive in the Nextflow process definition
* https://github.com/nextflow-io/nextflow/blob/f0ccfc7e83269bae6734fcafabf231a7c90e2087/modules/nextflow/src/main/groovy/nextflow/executor/SlurmExecutor.groovy#L86-L88