---
title: nf-core/mag troubleshooting
author: John Sundh
date: last-modified
format:
    confluence-html:
        code-fold: true
jupyter: python3
---

**Description**

This notebook describes my attempts to run the nf-core/mag pipeline for this project.

## Setting up nf-core/mag

Initially I looked into using the [nf-core/mag](https://nf-co.re/mag) Nextflow pipeline for this project.

I installed a conda environment specifically to use with nf-core/mag using:

```bash
export CONDARC="/proj/snic2020-5-486/nobackup/SMS-23-6668-micegut/.condarc"
mamba env create -f mag-env.yml -p envs/mag
mamba activate envs/mag
# Copy scripts to set and unset environment variables
mkdir -p $CONDA_PREFIX/etc/conda/deactivate.d $CONDA_PREFIX/etc/conda/activate.d
cp src/activate.d/env_vars.sh $CONDA_PREFIX/etc/conda/activate.d/
cp src/deactivate.d/env_vars.sh $CONDA_PREFIX/etc/conda/deactivate.d/
# Re-activate the environment
mamba activate envs/mag
```

However, after some trial runs I found that the pipline is likely not suited for this analysis. This could also be due to my lack of detailed knowledge and experience with debugging Nextflow pipelines but nevertheless I turned to other alternatives. See below for my attempts at troubleshooting.

### Troubleshooting

There are some issues with the nf-core/mag pipeline that become apparent when trying to use on a large dataset with several co-assemblies. Using the setup specified here with 108 samples divided into 10 assembly groups and specifying `--binning_map_mode all` results in 108 x 10 = 1080 bam files. With each bam file ~ 3-4 G in size that comes to 3.2 - 4.2 TB of data.

If assemblies could be created and binned one at a time that would allow temporary bam files to be cleaned up.

#### Strategy 1

First I tried to run the full pipeline as a node job after a tip from Phil Ewels (nf-core developer). In theory that would allow the pipeline to use local node storage for the work directory, after which only the finished results could be copied to the project folder. 

However this resulted in failed runs with no apparent output in either the nextflow log or slurm logs. Also there are obvious caveats to this as the work directory cannot be saved in intermediate states if the pipeline fails.

#### Strategy 2

Instead I tried to use group-specific input files where there'd be one sample list per assembly group. For example the 'mock' sample list would contain all samples but there would only be group assignments for the 'mock' group. The plan was to run the pipeline 10 times, once for each sample list. 

However, it appears the pipeline tries to create an assembly also for the unassigned group which obviously fails.

#### Strategy 3

Next strategy was to try and first generate the co-assemblies using a params file where `--skip_binning` is set to true. That would run only the QC + Assembly steps which would only generate ~350 G of storage for the QCd reads + a few additional G for each assembly.

Then using another params file where all samples are assigned to the already assembled group and specifying `--binning_map_mode all` maybe the pipeline would use the assembly as-is but map all samples to it?

For example, to run assembly for the 'mock' group:

```bash
nextflow run -c conf/custom.config -params-file conf/mag.assembly.yml nf-core/mag -r 2.3.0 -resume -profile uppmax --project snic2022-5-350 --input data/sample_list.mock-assemble.csv
```

Then to run binning for the same group:

```bash
nextflow run -c conf/custom.config -params-file conf/mag.binning.yml nf-core/mag -r 2.3.0 -resume -profile uppmax --project snic2022-5-350 --input data/sample_list.mock-binning.csv
```

However, this doesn't work either because the assembly is re-generated with all the samples.