# Assembly of nanopore sequences

Jacobo de la Cuesta-Zuluaga, June 2025.

The aim of this notebook is to execute the `nf-core` pipeline `bacass` for the assembly of a bacterial genome sequenced using nanopore. You can find the pipeline documentation [here](https://nf-co.re/bacass/2.4.0/).

## Libraries

In [9]:
library(tidyverse)
library(conflicted)

In [10]:
conflicts_prefer(dplyr::filter)

[1m[22m[90m[conflicted][39m Removing existing preference.


[1m[22m[90m[conflicted][39m Will prefer [1m[34mdplyr[39m[22m::filter over any other package.


## Load libraries and set paths

First, we'll set up the libraries and the work directory where we'll save our files. Note that these are pretty much the same as in notebook 01

In [11]:
# Directories
# Base directory
base_dir = "/mnt/lustre/groups/maier/maide581/projects/Huequito"

# Data
data_dir = file.path(base_dir, "data")

# fastq files
fastq_dir = file.path(data_dir, "fastq_files")

# sheets dir
sheets_dir = file.path(data_dir, "sheets")
dir.create(sheets_dir)

# sheets dir
assembly_dir = file.path(data_dir, "assembly")
dir.create(assembly_dir)

# Databases
k2_db = "/mnt/lustre/groups/maier/databases/Kraken_Bracken/k2_standard_16gb/k2_standard_16gb_20240605.tar.gz"

# Software
conda_env = "nextflow"

“'/mnt/lustre/groups/maier/maide581/projects/Huequito/data/sheets' already exists”
“'/mnt/lustre/groups/maier/maide581/projects/Huequito/data/assembly' already exists”


**Note** that `Huequito`'s repository includes a nextflow configuration file that increases the baseline computational resources used by the pipeline. If you want to use the default resource allocation, remove the `-c` agument from the `bacass` command. In most cases you won't need to change anything, this is just for your information.

In [12]:
# Custom config file
nextflow_config = file.path(base_dir, "config/nextflow.config")

## Prepare tables

`nf-core` pipelines require you to provide a table where the path of each sample to be processed is specified. You could do this manually, although it is better to have some code help you with that. The chunk below lists all the `fastq.gz` files in the sequences folder and creates a table with the necessary columns.

__Note__ that there a multiple columns with `NA`. This is because the assembly pipeline can use multiple read types as input, such as Illumina short reads. We will only use long reads, that's why the `LongFastQ` is the only one with data in it.

__Also take into account__ that you have to give a name or `ID` to each sample, so you'll need to modify the table if you have more than one sample to assemble.

In [13]:
# Create sample sheets
# List files and only retain fastq files
Long_reads = list.files(fastq_dir, full.names = TRUE, pattern = "fastq.gz$")

# Create samples table
reads_sheet = data.frame(LongFastQ = Long_reads) %>%
    mutate(ID = "S_dysgalactiae", 
           R1 = NA,
           R2 = NA,
           Fast5 = NA,
           GenomeSize = NA) %>%  
    relocate(ID) %>% 
    relocate(LongFastQ,.after = R2)

reads_sheet

ID,R1,R2,LongFastQ,Fast5,GenomeSize
<chr>,<lgl>,<lgl>,<chr>,<lgl>,<lgl>
S_dysgalactiae,,,/mnt/lustre/groups/maier/maide581/projects/Huequito/data/fastq_files/MMC234_202311.fastq.gz,,


The chunk below saves the sample sheet as a tab-separated file, which will be used as input for the assembly pipeline.

In [14]:
# Write file
Sdysgalactiae_samplessheet = file.path(sheets_dir, "Sdysgalactiae_samples.tsv")

reads_sheet %>%
    write_tsv(Sdysgalactiae_samplessheet)

## Execute pipeline

The code below constructs the bash command to activate the conda environment, change to the assembly directory, and run the `bacass` pipeline with all required arguments and resources.


In [15]:
# Create command
bacass_cmd = str_glue(
  "conda activate {{conda_env}} && \\
  cd {{out_dir}} && \\
  nextflow run nf-core/bacass -r 2.3.1 \\
    -profile m3c \\
    --input {{samples_sheet}} \\
    --outdir {{assemblies_dir}} \\
    -c {{nextflow_config}} \\
    --kraken2db {{kraken_db}} \\
    --annotation_tool prokka \\
    --assembler unicycler \\
    --assembly_type long \\
    --polish_method medaka \\
    --skip_kmerfinder")

Now we can replace the placeholders in the bacass command template with the actual paths and filenames defined above. Then, the chunk prints the full command for you to copy and run in your terminal.


In [16]:
assembly_cmd = str_glue(bacass_cmd,
                        conda_env = conda_env,
                        out_dir = assembly_dir,
                        samples_sheet = Sdysgalactiae_samplessheet,
                        assemblies_dir = assembly_dir,
                        kraken_db = k2_db)

assembly_cmd